2024-06-26

Attending: Brianna Pagān (BP), Thomas Nicholas (TN), Dennis Heimbigner (DH), Eric Perlman (EP), Sanket Verma (SV), Davis Bennett (DB)

TL;DR:

The team discussed the release of Zarr-Python 3.0.0a0 and its upcoming V3 release, the adoption and applications of Zarr in NASA projects, including the significant projected growth of Zarr data storage. They also explored issues related to metadata grooming and dataset conversion, and EP highlighted the parallel development tracks in bio and geo data management.

Updates:

Zarr-Python 3.0.0a0 out
- https://pypi.org/project/zarr/3.0.0a0/
Good momentum and lots of things happening with ZP-V3 - aiming for mid July release
SV represented Zarr at CZI Open Science 2024 meeting - various groups looking forward to V3 - https://x.com/MSanKeys963/status/1801073720288522466
- R users at bio-conductor looking to develop bindings for ZP-V3
New blog post: https://zarr.dev/blog/nasa-power-and-zarr/
ARCO-ERA5 got updated this week - ~6PB of Zarr data available - check: https://x.com/shoyer/status/1805732055394959819
https://dynamical.org/ - making weather data easy and accessbile to work with
- Check: https://dynamical.org/about/
- Video tutorial: https://youtu.be/uR6-UVO_3k8?si=cp0jOxrtKL_I6LfV

Meeting Minutes:

BP: Will be talking about how Zarr is utilised at NASA!
- starts screen sharing and presenting
- BP: I work at Goddard GES DISC - deputy manager at one of the centres - manages team of developers and engineers - not representing all the data centres
- BP: Lot of people are coming into Zarr from the SMD (Science mission directorates)
- BP: Earth Science Division - EOSDIS and Distributed Active Archive Centres (DAACs) - DAACs focuses on data distribution and management
- BP: All the centres coming up with the suggestion on best practices and best format - we discuss with them the possibility of what they can, and should use
- BP: Moving to cloud optimized format - DAACs have ton of archival data in various formats
- BP: Projected growth for entire Zarr store across all EOSDIS by 2030 60PB -> 600PB!
- BP: GES DISC holds 7 PBs of data - we have 3000 different collections of datasets - really diverse!
- BP: Giovanni - interactive web-based program have 20+ services associated with it - taking the existing data and grooming the metadata so it’s accessible and useful across broader range
- BP: Over at NASA, we do many Zarr stuff…
  - Zarr V2 spec is approved data format convention for use in NASA Earth Science Data Systems (ESDS)
  - Giovanni in cloud - duplicates Zarr (variable based)
    - Open issue: continuously updating Zarr stores - Exploring lakeFS for managing dynamic data
  - ZEP0005
  - Brianna is leading the GeoZarr work
  - VEDA - no. of things Zarr/STAC related going on in VEDA
- TN: Does Giovanni read Zarr directly? If so which reader does it use? (Can Goivanni use VirtualiZarr?)
  - BP: Goivanni promotes variable first search - most of Goivanni has OpenDAP attached to it - builts with overhead with GES DISC pipeline - in hindsight- Yes!
  - TN: From the slides - Xarray can take care of some of the stuff that Giovanni does
- TN: Very curious about the exact difference between the LakeFS idea and EarthMover’s ArrayLake
  - BP: LakeFS is OS ArrayLake - no vendor lock-in
- SV: What does Giovanni actually do when you say, ‘it grooms metadata’?
  - BP: Standardizes the grid - flip the grid - naming mechanism - smoothing the metadata so that it works across various services
  - BP: other grooming metadata is for example we have alot of time dimension issues. that’s because of scattered best practices for how to store time metadata
  - TN: Can we do the flipping with Zarr/VirtualiZarr?
  - DB: If you flip at the store level - you’d need to find out the how deep you’d need to go
  - BP: Will try to make time standard across the datasets
  - BP: https://github.com/briannapagan/quirky-data-checker
- BP: from the Zoom chat
  - Zarr Storage Specification V2 is an approved data format convention for use in NASA Earth Science Data Systems (ESDS). https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices
  - Giovanni in the Cloud, duplicate archive, zarr, variable-based: https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISC&pretty=True
  - Open issue: continuously updating zarr stores. Exploring lakeFS for managing dynamic data
  - ZEP 0005: Zarr accumulation extension for optimizing data analysis
  - Looking into a GIS service for zarr stores
  - POWER: https://power.larc.nasa.gov/data-access-viewer/
  - https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
  - https://discourse.pangeo.io/t/metadata-duplication-on-stac-zarr-collections/3193/7
EP: Converting OME datasets in V3 in upcoming months - quirky tool can be useful
- DB: V3 chunking encoding matches with V3 encoding - you just need to re-write the JSON document
- DB: Playing with sharding - tensorstore is fast - need to figure out the nomenclature
- EP: The bio and geo world have parallel tracks and working in silos
- EP: https://forum.image.sc/t/ome2024-ngff-challenge/97363
  - DB: The challenge doesn’t seems interesting to me! - convering JSONs documents - instead we should be focusing on converting existing data to sharded stoes - much interesting problem
- EP: Bunch of data is non-Zarr and would be working on to push them to cloud and convert it to Zarr