Old format, no problem!: Cloud-optimizing the GOES-16 archive as Virtual Zarr | Earthmover<br>Icechunk-ERA5: a daily updating, performance-optimized ARCO data cube, with 86 years of 43 surface and pressure-level variables.<br>Available now on the Earthmover Data Marketplace
Product<br>Platform Marketplace Open Source<br>Solutions<br>AI-Driven Modeling Energy Trading<br>Company<br>About Blog Case Studies Events Careers<br>Blog<br>Docs
Login<br>Get started
Blog Post<br>June 2, 2026 Old format, no problem!: Cloud-optimizing the GOES-16 archive as Virtual Zarr
Tom Nicholas Software Engineer
Tl;dr: Using VirtualiZarr, Icechunk, and Arraylake, we made a massive archive of GOES-16 satellite imagery available a single cloud-optimized virtual Zarr store, all without copying any data. View the repo here!
The future of scientific data is cloud-native, but archival formats aren’t
Cloud is clearly the future of scientific data sharing.<br>“Analysis-Ready, Cloud-Optimized” (or ARCO) data stores are now being been championed by major public agencies internationally (including USGS, NOAA, NASA, and ECMWF), as well as myriad non-profits and private companies.<br>Compared to traditional fileserver-backed data portals, ARCO data stores are more scalable to big datasets, more reliable, more performant for high-throughput workloads such as ML training, are a cost-efficient way to serve unlimited numbers of users, and are globally accessible so better facilitate collaboration.
However, many scientific file formats still in widespread use predate the invention of the cloud , sometimes by over 15 years!<br>These “archival” file formats don’t work well at all in cloud object storage, because they were never designed with that use case in mind.<br>(For a deep dive on exactly what makes a format “cloud-optimized”, and just how much difference it makes, read our earlier post “What is Cloud-Optimized Scientific Data?”).
Dilemma: Duplicate data or let users suffer?
Pressure to move to the cloud under budget constraints mean data providers often begin a cloud migration by performing a “lift and shift”, whereby data archives are uploaded to cloud object storage in their original formats without any other modifications.<br>Their data is now technically accessible through the cloud, but without many of the benefits of a truly cloud-native system.
In an ideal world, data providers would convert all the data into a cloud optimized formats such as Zarr, but often there are requirements to keep the data available in the older formats for several years yet.<br>This presents a dilemma: either they create an additional copy of the data (costing twice as much - very expensive for Petabyte-scale datasets), or they leave the data in a format that is poorly suited for the cloud.<br>In some situations, there’s a strong case to be made to duplicate the data, but we know that is not always an option.
Solution: Cloud-optimised “virtual stores” referencing existing files
One solution to this dilemma is to provide cloud-optimized access to the contents of the files, without modifying or duplicating them.<br>This is what “virtual Zarr stores” enable, allowing stewards of massive archival scientific datasets to provide a great experience for their users, with only a single copy of the data.
Let’s briefly discuss how it’s possible to provide cloud-optimized access to non-cloud-optimized files.<br>Unlike traditional POSIX file systems, you interact with data in cloud object storage over HTTP.<br>The tricky thing about cloud object storage is finding out exactly where the chunks of data you want are located without the use of filesystem primitives like seek().<br>Once you do know their exact locations you can fetch many chunks efficiently in parallel using HTTP range requests (i.e. you can achieve very high throughput).<br>Unfortunately pre-cloud file formats usually don’t come with an efficient way to find out where the chunks you want are.<br>In general you may have to download all the data just to learn what’s in the data - this is incredibly inefficient and severely limits what users are able to easily do with the data.
However, if someone has already done the up-front work of scanning the data to find the exact locations of every chunk, you can simply consult that mapping and immediately know exactly which location from which to fetch the chunks you want.<br>Amazingly, this trick works for many different scientific file formats, including HDF5, netCDF3, netCDF4, GeoTIFF, GRIB, FITS, and even Zarr itself!<br>You end up with a “chunk manifest” for every variable that looks something like this:
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},<br>"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},<br>"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},<br>"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},<br>An example chunk manifest: each Zarr chunk key maps to a byte range (offset and length) within an existing file.
Once generated, this mapping can be...