80TB+ of astronomy for the HDD-poor: crossmatch the Universe from your laptop

80TB+ of astronomy for the HDD-poor: crossmatch the Multimodal Universe from your laptop

Back to Articles a]:hidden">

80TB+ of astronomy for the HDD-poor: crossmatch the Multimodal Universe from your laptop

Community Article Published June 29, 2026

Upvote 5

Mike Smith Smith42 Follow

hugging-science

TL;DR

The Multimodal Universe (MMU) pools together 80TB1 plus of data from over 30 astronomical surveys into one place. Crossmatching (linking observations of the same object across surveys) is its killer feature, but until now it required downloading hefty chunks of data to local disk. We got tired of needing a cluster just to run a crossmatch, so we gathered in the UniverseTBD and Hugging Science Discord servers to fix that. We've converted the MMU to the parquet-based HATS format so that you can use the LSDB and Hugging Face ecosystems to crossmatch from a laptop. The datasets are in this Hugging Face collection. No bulk downloads are necessary, and 4GB of RAM is enough even at Gaia scale. Here it is in action:

That whole run ☝️ is really just this 👇:

# /// script # requires-python = ">=3.12" # dependencies = ["lsdb", "huggingface-hub"] # /// import lsdb

gz10 = lsdb.open_catalog("hf://datasets/UniverseTBD/mmu_gz10") sdss = lsdb.open_catalog("hf://datasets/UniverseTBD/mmu_sdss_sdss")

xmatch = gz10.crossmatch(sdss, n_neighbors=1).compute()

What exactly is the Multimodal Universe?

The MMU in a nutshell. CREDIT: MMU Collaboration (2024)

The MMU is what happens when a bunch of astronomers from a bunch of institutes get tired of every survey having its own quirks and decide to put everything in one format that non-astronomers can use easily. We're talking about over 80TB of galaxy imagery taken across the spectrum, spectra of galaxies and stars, time series of variable stars, and a whole zoo of assorted measurements and physical data. The idea is simple: you shouldn't need a PhD in a specific survey's archival system to load the data from that survey and do cool science with it.

In astronomy, data is abundant yet fragmented; it's out there, but every survey does things a little differently. Different storage formats, different documentation, and different gotchas (like one survey storing brightness as magnitudes and another as flux in Janskys) that you need to be familiar with to get the most out of it. The MMU standardises all of this, and provides a nice crossmatching utility on top so that you can link objects across surveys to build your own multimodal datasets.

There is a big catch though. In the MMU v1.0, crossmatching requires you to download data in its native HDF5 format. Even for a subset of the full MMU this is a significant chunk of space. The individual datasets stream beautifully from Hugging Face, but the moment you want to combine them (which is the whole point!) you need local storage that is just out of reach for many astronomers. This means that the MMU's most powerful feature is not usable by those of us that have computers that are more potato than Ferrari.

A freshly avoidable issue.

So we fixed it. We converted the individual MMU datasets from their old HDF5 format into HATS, and uploaded them to Hugging Face. The LSDB folks had already built streaming crossmatching into their library, and once the data was in the right format and in the right place, it just worked. We'll get to that shortly, but first let's discuss why crossmatching is worth all this trouble.

Why should you care about crossmatching astronomy data?

Crossmatching is the process of saying "this image, this spectrum, and this photometry are descriptions of this object". In astronomy you do this by tracking where in the sky each survey is looking, recording that location, and then calling two observations "the same thing" if they are close enough together. This is a simple enough process in principle, but a real nightmare when you try to do it at scale.

So what does crossmatching look like in practice for astronomers and what science does it enable? A great example is documented within the paper "Luminous Late-time Radio Emission from Supernovae Detected by the Karl G. Jansky Very Large Array Sky Survey (VLASS)" by Michael Stroh and others. They took ~70k examples of previously found supernovae, crossmatched their sample with the VLASS radio survey, and then filtered candidates over the Chandra, LOFAR, XMM-Newton, and Swift-XRT surveys. This whittled down 70,000 candidates to just 19 supernovae.

Each of these remnants was weirdly still pumping out radio radiation many years after its explosion. Normally a supernova's radio emission fades predictably as the shock wave plows through the relatively thin gas surrounding the dying star. Stroh and friends proposed three explanations for this. The shock wave may have slammed into a dense shell of material that the star shed into space centuries before it exploded. The radio emission might be coming from a jet of material (similar to those seen in...

80TB+ of astronomy for the HDD-poor: crossmatch the Universe from your laptop

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level