There are many large datasets from observational astrophysics. They are often stored in inefficient data formats, unoptimized for storage and large-scale analysis.

Here we publicly release useful observational datasets converted to single-file (monolithic) HDF5 and/or Zarr formats.

Large file sizes.

Some of these datafiles are TB or larger. You are strongly recommended to use a resumable download tool, so that a download can be resumed if it is interrupted. For example, wget -c https://www.tng-project.org/path/to/file.hdf5.

Acknowledgment.

Please make sure to follow the relevant citation requests from the original data sources.
In addition, please cite (e.g. in a Data Availability section) the TNG data release paper (Nelson+2019) to acknowledge use of this resource.

GAIA

The entire GAIA DR3 catalog. Converted directly from the raw CSV files available on the ESA Gaia Archive. The full dataset is split into two files, and the first contains the fields most users will be interested in. Every dataset has a description attribute, including physical units.

  • gaia_dr3.hdf5 (237 GB) - 1,811,709,771 entries.
    Contains fields: [l, b, parallax, parallax_error, ra, ra_error, dec, dec_error, distance_gspphot, distance_gspphot_lower, distance_gspphot_upper, mh_gspphot, mh_gspphot_lower, mh_gspphot_upper, phot_bp_mean_flux_error, phot_bp_mean_mag, phot_g_mean_flux_error, phot_g_mean_mag, phot_rp_mean_flux_error, phot_rp_mean_mag, pmdec, pmdec_error, pmra, pmra_error, radial_velocity, radial_velocity_error, source_id].
  • gaia_dr3_aux.hdf5 (714 GB) - all other fields not contained in the file above.
  • gaia_dr3.zarr (60 KB) - virtual zarr wrapper.
  • gaia_dr3_mini.hdf5 (13 MB) - 'mini' subset of the full DR3 dataset, containing only 100,000 bright stars (for testing).

SDSS/BOSS

All spectra taken by SDSS/BOSS (DR17). Converted from the original FITS files available on the SDSS Science Archive Server into a single HDF5 file. All spectra have been placed on the common wavelength grid of 4700 points, from 3531 to 10324 angstrom. Every dataset has a description attribute, including physical units.

  • sdss-dr17-spectra.hdf5 (256 GB) - N = 4,864,154 entries.
    Contains fields: flux [N, 4700], ivar [N, 4700], model [N, 4700], loglam [4700], wave [4700], specobjid [N], class [N, 0 = galaxy, 1 = star, 2 = qso], ra [N], dec [N], z [N], z_err [N], airmass [N], extinction [N], sn_median_all [N], vdisp [N].

KODIAQ

The Keck Observatory Database of Ionized Absorption toward Quasars (KODIAQ) catalogs contain all quasar spectra taken with the ESI and HIRES spectrographs on Keck. All spectra are fully reduced, coadded, continuum normalized, and publicly available from the Keck Observatory Archive (KOA).

HSLA-COS

All data ever taken with the Hubble Space Telescope Cosmic Origins Spectrograph (COS). Compiled by the Hubble Spectroscopic Legacy Archive (HSLA) up to HST Cycle 26 (archive made on 15 May 2018). Contains all raw and associated combined ultra-violet (FUV and NUV) spectra from COS. The targets span all science categories. Each grating is stored in a group, with datasets: flux, wave, error, target_dec, target_ra, target_name, target_desc, target_type.

  • HSLA-COS.hdf5 (1.2 GB) - contains FUVM3 and G160M (35,601 total spectra), G140L (18,500 total spectra), and FUVM5 and G130M (58,958 total spectra) gratings.

XQR-30

The public data release of reduced spectra from the E-XQR-30 quasar sample. These are very high resolution (R ~ 10,000) spectra of reionization era (z > 6) quasars from XSHOOTER.

  • XQR-30.hdf5 (21 MB) - contains 30 spectra, with both NIR and VIS coverage.

Coming soon...

For the future: DESI, HETDEX, eROSITA, LoTSS, MUSE.

Interested? Other ideas? Get in touch.