2024-04-17
Attending: Josh Moore (JM), Davis Bennet (DB), Liam Dennis (LD), Eric Perlman (EP), Altay Sansal (AS)
TL;DR:
The team discussed open ZEPs, metadata conventions, and the progress of Zarr-Python V3. Other topics included challenges with slicing data, optimizing compression methods, and the potential for runtime dispatch support for both V2 and V3 groups in Zarr-Python.
Meeting Minutes:
- Happy Birthday, Sanket
- Introductions
- Josh: beverage of choice: whisky –> gin.
- Liam: finance –> energy. forecasting things like Weather. slicing regularly grided. NZ so kiwi fruit juice
- Eric: neuroscience/freelance. dirty chai latte.
- Davis: big image datasets/freelance.
- Altay: TGS lead data scientist. energy companies. wind, solar, gas. core is seismic. manage petabytes of data in zarr. Google cloud. (last 3 years). big fan. things slowing down. looking to help. coffee and old fashions.
- AS: ZEPs that are open. How to move them forward?
- JM: help on getting
- DB: Zarr Object Model is there. Don’t think much about ZEPs.
- AS: discussion on metadata conventions?
- DB: not sure what’s there.
- DB: to conventions, doesn’t go far enough. want to validate the hierarchy as well.
- DB: on the ZOM, will wait until run into a need
- JM: probably when you cross language barriers
- DB: tried a typescript implementation. That’s two. will revisit though as needed.
- AS: model is to validate the structure. We have a parallel effort. MDIO. https://mdio-python.readthedocs.io/en/stable/
- in the current stable branch (working on a v1 release) have json-schema to create zarrs (in the energy domain)
- also building a C++ API for this using tensorstore
- Another can of worms. lacking some tensorstore features like groups.
- DB: possible answer “zarr group is just a prefix with some JSON”
- AS: which zarr should I use. zarr-python slowing down.
- DB: yes, see the issues which are labeled “V3”
- JM: see https://github.com/zarr-developers/zarr-python/issues/1777
- EP: use different implementations depending on what I’m doing
- AS: talked to Joe.
- JM: scipy! can DB propose one?
- DB: logic to support v2 and v3 groups. runtime dispatch on what you read from the group. (not for arrays. get an exception if you hit a v2 array)
- perhaps 1774 (logging), 1773 (typedict/easy), …
- testing! about to merge some work on group tests.
- in v2 the test suite, new one will be tighter. e.g., a template that you can build on.
- DB: but there are still APIs that aren’t figured out. v2 had APIs that weren’t intuitive coming from h5py, but weren’t good for performance.
- DB: still need to answer the question “what is a zarr array?”
- AS: there’s one in tensorstore, even if not perfect. (DB: happy to take issues like that)
- DB: tests for the array API?
- DB: (sidenote) slice a zarr and get a zarr (or future) without depending on dask.
- DB: (sidenote) spec to define nouns not verbs…
- LD: re: slicing any advice/challenges
- efficient data structure for getting time-series slices and geo-graphic slices (region over time)
- compute versus space
- AS: weather forecast models. pre-calculate summary statistics possibly. partial reads help, but still a bit clunky. (similar to sharding which also helps though the zarr-python implementation is slow. tensorstore is fast.)
- was using 1283 cube. went to 323 in shards. and got 4x speed up. (slower in zarr-python)
- LD: colleague is pushing tiledb. lazy might help.
- AS: tileDB too expensive
- miscellaneous compression thoughts
- https://www.blosc.org/pages/btune/
zarr optimize --force
CLI? (Anyone?)- data engineering malpractice! (JPEG-like)
- lossy compression
- in-place codec conversion
- oopsies. 1B calls for “does file exist”