As Earth science data archives and data density continue to increase, traditional science workflows of data download, conditioning, and analysis become more and more unwieldy. Network bandwidth, local storage, and computer performance all place cost and time constraints that an investigator must account for before science and hypothesis testing can begin.
Virtualized datasets offer a pathway to navigate around these issues; these lightweight reference files can be used to access an entire data record using Python packages like Xarray. From there, users can quickly subset to their region and timespan of interest, eliminating the need to download and subset thousands of files and terabytes of data. This presents a new pathway for both streamlined data access and improved science workflows where a user can easily iterate over datasets, change space and time bounds, and quickly compare complementary datasets.
NASA’s Physical Oceanography Distributed Active Archive Center (PO.DAAC) has created 10 virtualized datasets covering ocean currents, winds, bottom pressure, sea surface height, salinity, and temperature from satellite observations and ocean models. In this webinar, we will briefly describe the fundamentals of the technology and demonstrate how to use it in Python scripts and notebooks. We also present performance metrics from computing a regional mean time series of satellite records 25-40 years in length, showing a full order of magnitude improvement in compute time compared to traditional access and methods.
Lastly, we will present examples of utilizing virtual datasets to conduct real world science investigations, including interdisciplinary relationships between wind and ocean response during upwelling events, Indian Ocean Dipole surface characteristics, and the ocean response to El Niños.