2023-11-29

Attending: Mike Zuranski (MK), Sanket Verma (SV), Dennis Heimbigner (DH), Josh Moore (JM), Florian Ziemen (FZ), Ward Fisher (WF), Eric Perlman (EP), Gābor Kovācs (GB), Davis Bennett (DB), Matt Swanson (MS), Jeremy Maitin-Shephard (JMS), Jeremy Delahanty (JD)

TL;DR:

We had new joiners in the meeting today, so we started the discussion with a quick round of introductions along with everyone’s favourite YouTube channel. After that, MS had a question regarding write_empty_chunks for his weather data app. MZ asked for Zarr’s educational materials and shared his learning resources. Jack Kelly helped JMS identify specific performance improvements for Tensorstrore while reading small chunks. At last, the meeting ended with GB asking questions about Dask and DB asking whether Tensorstore can do anything with Zarr groups.

Updates:

Zarr-Python refactor
- PR to add roadmap: https://github.com/zarr-developers/zarr-python/pull/1583
- Initiation of V3 branch: https://github.com/zarr-developers/zarr-python/pull/1584

Meeting Minutes:

Introductions w/ favourite YT channels
- SV: Vox
- WF: jacksepticeye
- GB: Allen Institute
- FZ: John Tucker (British Photography guy)
- MZ: Data Engineer @ Unidata - Twitch - EJ_SA
- EP: NotJustBikes
- DB: Making knives in Japan
- MS: R&D Engineer (Software Engineer) - Ididathing
- JMS: Google Research - Steve1989MRE
- DH: Topics: K-Pop
- JD: Grad Student @ John Hopkins University - F1 Racing
SV: Zarr V3.0 design document - https://zarr.readthedocs.io/ or https://github.com/zarr-developers/zarr-python?
- DB: Easy for public to find in docs
MS: Zarr to store weather data - also ingesting data - trying to not write_empty_chunks - calling dataset again resets the empty chunks value - how to change that?
- DB: Private attribute - you can modify the attribute - depends on how you’re accessing the array - make sure you’re constructing the array with group
- MS: When appending to Zarr array - the write_empty_chunks setting go back to default
- DB: More of an implementation detail - not specification detail
- DH: Is there a way to set an environment variable?
- DB: No. - If there’s need to global variable we can add it
- MS: Everytime I access the dataset - I don’t want to write empty chunks - but I don’t want it to
- later in the meeting MS shares screen and DB help debug the code
MZ: No. of ways you can access/read the data - sometimes using bodo model - where could I go for reference material?
- SV : https://zarr.readthedocs.io/en/stable/tutorial.html
- MS: https://mesowest.utah.edu/html/hrrr/zarr_documentation/html/zarr_api_multiple_hrrr_variables.html
- MS: https://github.com/ProjectPythia/HRRR-AWS-cookbook
- DB: In context of earth data - Pangeo is a good resource
- JD: Dask blog: https://blog.dask.org/2018/01/22/pangeo-2
JMS: Jack Kelly has done benchmarks in Tensorstore - performance improvements for small chunks
- EP: Define small chunks.
- JMS: 4kb-10kb; 16 cubed or 32 cubes chunks
- EP: Tradeoff - visualisation between programmatic access
- EP: Decompressed chunks?
- JMS: Yes, decompressed chunks. Per chunk overhead - in the case where - you’re reading a small chunk and then a larger array - the way we were copying the smaller chunks were sub-optimal - copying the contiguous bytes were small
- JMS: Sharded format reduce the tradeoff - for visualisations we can access smaller chunks
GK: Dask cluster was doing some copying - lazy enough that actual data wasn’t read - is the Zarr-Python code lazy for
- DB: Zarr Array API is not lazy - it immediately triggers computation - actual slicing happen on the worker
- GB: Dask does the lazy computation?
- DB: Dask takes a lazy approach to this operation - array indexing is an operation
- DB: FSSPEC maintains cache - also LRU might also maintains it
- JMS: Tensorstore does caching as well
MS: shares screen
- DB: access the array directly using zarr.open() and change the parameters
DB: Can Tensorstore do anything with Zarr groups?
- JMS: Not atm.
- DB: Want to have a non-Zarr-Python backend in Pydantic-Zarr
- DB: Trying to look at Zarr-Python, Zarrita, Tensorstore
- EP: Would be good to have that as well
- JMS: Part of the issue just came up when opening arrays
- SV: How’s the development going?
- DB: Been using pre-2.0 pydantic-Zarr for work - super useful for my work - helpful debug messed-up hierarchies
- SV: Brief on PyDantic-Zarr
- DB: gives the intro
- DB: Totally ignores the value - but there are type checker
- DB: Checking is a big ask - Can use Pydantic checker for that
- DB: Getting value out of inspecting hierarchies