2024-04-17

Attending: Josh Moore (JM), Davis Bennet (DB), Liam Dennis (LD), Eric Perlman (EP), Altay Sansal (AS)

TL;DR:

The team discussed open ZEPs, metadata conventions, and the progress of Zarr-Python V3. Other topics included challenges with slicing data, optimizing compression methods, and the potential for runtime dispatch support for both V2 and V3 groups in Zarr-Python.

Meeting Minutes:

Happy Birthday, Sanket
Introductions
- Josh: beverage of choice: whisky –> gin.
- Liam: finance –> energy. forecasting things like Weather. slicing regularly grided. NZ so kiwi fruit juice
- Eric: neuroscience/freelance. dirty chai latte.
- Davis: big image datasets/freelance.
- Altay: TGS lead data scientist. energy companies. wind, solar, gas. core is seismic. manage petabytes of data in zarr. Google cloud. (last 3 years). big fan. things slowing down. looking to help. coffee and old fashions.
AS: ZEPs that are open. How to move them forward?
- JM: help on getting
- DB: Zarr Object Model is there. Don’t think much about ZEPs.
- AS: discussion on metadata conventions?
- DB: not sure what’s there.
- DB: to conventions, doesn’t go far enough. want to validate the hierarchy as well.
- DB: on the ZOM, will wait until run into a need
  - JM: probably when you cross language barriers
  - DB: tried a typescript implementation. That’s two. will revisit though as needed.
  - AS: model is to validate the structure. We have a parallel effort. MDIO. https://mdio-python.readthedocs.io/en/stable/
    - in the current stable branch (working on a v1 release) have json-schema to create zarrs (in the energy domain)
    - also building a C++ API for this using tensorstore
    - Another can of worms. lacking some tensorstore features like groups.
    - DB: possible answer “zarr group is just a prefix with some JSON”
- AS: which zarr should I use. zarr-python slowing down.
  - DB: yes, see the issues which are labeled “V3”
  - JM: see https://github.com/zarr-developers/zarr-python/issues/1777
  - EP: use different implementations depending on what I’m doing
  - AS: talked to Joe.
  - JM: scipy! can DB propose one?
  - DB: logic to support v2 and v3 groups. runtime dispatch on what you read from the group. (not for arrays. get an exception if you hit a v2 array)
    - perhaps 1774 (logging), 1773 (typedict/easy), …
    - testing! about to merge some work on group tests.
    - in v2 the test suite, new one will be tighter. e.g., a template that you can build on.
  - DB: but there are still APIs that aren’t figured out. v2 had APIs that weren’t intuitive coming from h5py, but weren’t good for performance.
  - DB: still need to answer the question “what is a zarr array?”
  - AS: there’s one in tensorstore, even if not perfect. (DB: happy to take issues like that)
  - DB: tests for the array API?
  - DB: (sidenote) slice a zarr and get a zarr (or future) without depending on dask.
    - https://github.com/zarr-developers/zarr-python/discussions/1603
  - DB: (sidenote) spec to define nouns not verbs…
LD: re: slicing any advice/challenges
- efficient data structure for getting time-series slices and geo-graphic slices (region over time)
- compute versus space
- AS: weather forecast models. pre-calculate summary statistics possibly. partial reads help, but still a bit clunky. (similar to sharding which also helps though the zarr-python implementation is slow. tensorstore is fast.)
  - was using 1283 cube. went to 323 in shards. and got 4x speed up. (slower in zarr-python)
  - LD: colleague is pushing tiledb. lazy might help.
  - AS: tileDB too expensive
miscellaneous compression thoughts
- https://www.blosc.org/pages/btune/
- zarr optimize --force CLI? (Anyone?)
- data engineering malpractice! (JPEG-like)
- lossy compression
- in-place codec conversion
- oopsies. 1B calls for “does file exist”