Data formatted for the cloud offers researchers powerful features that make them faster, easier, and cheaper to use. Now, a new NASA effort is bringing some of those same enhancements to non-cloud-optimized data. Members of NASA’s Openscapes and earthaccess communities recently collaborated to engineer a trio of tool updates that bring cloud-like capability to very popular, non-cloud formats such as netCDF/HDF files found in NASA's Earthdata archive.
The updates include:
- Ongoing development of the VirtualiZarr Python package for creating and operating on virtual datacubes using an xarray-like interface
- A DMR++ parser that allows VirtualiZarr to read various file and non-cloud optimized formats
- earthaccess library enhancements that streamline user workflows by combining NASA Earthdata authentication with virtual datacube functionality
“These tools can significantly improve data analysis workflows across Earth science disciplines, including climate, oceanography, and atmospheric science,” said Elizabeth Joyner, product owner for the Satellite Science Mission Support Team and science outreach lead at NASA’s Atmospheric Science Data Center (ASDC). “The tools offer faster processing, greater utilization of archived data collections through quicker subset extraction, and potential cost savings.”
Work on the three tools is done through a collaboration led by people from different parts of NASA, as well as businesses and nonprofit organizations outside the agency. The collaboration and its Open Source Science approach have proven highly effective in driving the project's substantial progress.
“Openscapes provided facilitation and mentorship to guide the effort, while development and technical contributions were shared across the cross-community team,” said co-lead Dr. Daniel Kaufman, the ASDC Tropospheric Emissions: Monitoring of Pollution (TEMPO) lead data scientist and an Openscapes mentor.
The project’s innovation and utility has drawn the appreciation of the Earth science data community and led to an opportunity for the team to present on it at the American Geophysical Union Fall 2025 Meeting.
Creating Virtual Datasets
One of the three tools the team worked on is VirtualiZarr, which allows datasets to mimic the Zarr data format. Zarr is natively cloud-optimized and doesn’t require users to restructure, reformat, or duplicate the original dataset. The virtual dataset provides information in the form of chunk manifests, which catalog data chunks and enable efficient access to specific parts of the datacube.
Datacubes are a multi-dimensional array of data representing variables (e.g., temperature or precipitation) across dimensions like time, latitude, and longitude. Datacubes treat datasets spanning many files or cloud objects as cohesive units, simplifying analysis workflows.
VirtualiZarr simplifies complex processes such as handling and translating between files, data chunk manifests, and serialized references, making virtual datacube workflows more accessible, reliable, and consistent with existing Python tools.
Parsing the Data
The team also developed a DMR++ parser that interprets and processes data from a specific file format and is geared for granule-level virtual dataset access. DMR++ is an XML metadata file used by the OPeNDAP/Hyrax data server to describe what’s inside a data granule and where its data chunks are located so users can read, subset, and reformat just what they need without downloading the entire file.
At ASDC, DMR++ files are generated as part of the Cumulus ingest workflow and archived alongside the data, enabling users to subset and reformat data through post-processing services such as Harmony. When paired with VirtualiZarr, the parser allows the framework to read different file types and apply a unified processing pipeline to any of them.
The parser can also be configured to improve access performance for existing Earthdata collections that are stored in traditional, non-cloud-optimized formats and unlikely to be converted to optimized formats. These files often lack compatibility with collection-level virtual datacube specifications.
By enabling users to work with these datasets in a manner that feels closer to the experience of cloud-optimized formats, the parser becomes a key tool for enhancing the usability of NASA’s Earthdata archive.
Enhancing the earthaccess Python Library
The earthaccess updates the team made integrate virtual datacube workflows into the earthaccess program interface while preserving the library’s simplicity and user-friendly design. earthaccess uses other Python tools and the Earthdata Common Metadata Repository (CMR) to dynamically adapt to the format, location, and capabilities of datasets. By abstracting away unnecessary configuration details, these enhancements aim to make programmatic workflows more intuitive and open to a broader range of users.
“It is a huge deal for researchers’ workflows with earthaccess to ‘just work’ even when accessing data that is not cloud-optimized,” says Julia Lowndes, a core Openscapes team member and its founding director. “earthaccess is working behind the scenes to help data masquerade as cloud-optimized so that researchers don’t get slowed down with data formats so they can focus on the science.”
TEMPO Data in a Fraction of the Time
There are many datasets of high interest to researchers and experts that are likely able to support the VirtualiZarr, DMR++ parser, and earthaccess library enhancements. A short list of datasets includes:
- TEMPO L3 air quality concentrations over North America
- PREFIRE Surface Emissivity Sorted All-sky monthly climatologies
- NISAR Synthetic Aperture Radar
- IMERG rainfall estimates
- SMAP L4 (SPL4SMGP) a 3-hour global dataset that would take a week to process serially and we analyze it in less than a minute
- MUR L4 sea surface temperature
- SWOT L2 SSH 2.0
The TEMPO L3 air quality data provides a good example of the benefits of leveraging virtual dataset workflows using DMR++ parsing, VirtualiZarr, and earthaccess. TEMPO measures air pollutants over North America during daylight, producing hourly East-to-West scans that generate ~10 Level 2 netCDF files and one Level 3 file per hour, with DMR++ metadata files generated alongside each netCDF. A year’s worth of Level 3 TEMPO data totals ~5,000 data files/granules (~2.5 TB).
The team tested the TEMPO data on the Openscapes JupyterHub, a cloud computing space managed by partners 2i2c in the Amazon Web Services (AWS) Cloud environment. The combination of the DMR++ parser and VirtualiZarr enabled a year’s worth of Level 3 TEMPO data to be virtually opened as granule-level chunk manifests (with merged netCDF groups) and merged into a unified datacube.
Analyses across the full dataset were completed in about 10 minutes, compared to several hours for streaming the data directly or upwards of 24 hours for downloading it locally. earthaccess enabled the workflow to be streamlined by facilitating seamless authentication and search.
“This example highlights the benefits of integrating DMR++ parsing, VirtualiZarr, and earthaccess to enable fast analysis, reduce data transfer requirements, and handle large datasets efficiently,” said Kaufman, “All without reliance on significant local storage or computing resources.”