2024-04-17

Attending: Josh Moore (JM), Davis Bennet (DB), Liam Dennis (LD), Eric Perlman (EP), Altay Sansal (AS)

TL;DR:

The team discussed open ZEPs, metadata conventions, and the progress of Zarr-Python V3. Other topics included challenges with slicing data, optimizing compression methods, and the potential for runtime dispatch support for both V2 and V3 groups in Zarr-Python.

Meeting Minutes:

  • Happy Birthday, Sanket
  • Introductions
    • Josh: beverage of choice: whisky –> gin.
    • Liam: finance –> energy. forecasting things like Weather. slicing regularly grided. NZ so kiwi fruit juice
    • Eric: neuroscience/freelance. dirty chai latte.
    • Davis: big image datasets/freelance.
    • Altay: TGS lead data scientist. energy companies. wind, solar, gas. core is seismic. manage petabytes of data in zarr. Google cloud. (last 3 years). big fan. things slowing down. looking to help. coffee and old fashions.
  • AS: ZEPs that are open. How to move them forward?
    • JM: help on getting
    • DB: Zarr Object Model is there. Don’t think much about ZEPs.
    • AS: discussion on metadata conventions?
    • DB: not sure what’s there.
    • DB: to conventions, doesn’t go far enough. want to validate the hierarchy as well.
    • DB: on the ZOM, will wait until run into a need
      • JM: probably when you cross language barriers
      • DB: tried a typescript implementation. That’s two. will revisit though as needed.
      • AS: model is to validate the structure. We have a parallel effort. MDIO. https://mdio-python.readthedocs.io/en/stable/
        • in the current stable branch (working on a v1 release) have json-schema to create zarrs (in the energy domain)
        • also building a C++ API for this using tensorstore
        • Another can of worms. lacking some tensorstore features like groups.
        • DB: possible answer “zarr group is just a prefix with some JSON”
    • AS: which zarr should I use. zarr-python slowing down.
      • DB: yes, see the issues which are labeled “V3”
      • JM: see https://github.com/zarr-developers/zarr-python/issues/1777
      • EP: use different implementations depending on what I’m doing
      • AS: talked to Joe.
      • JM: scipy! can DB propose one?
      • DB: logic to support v2 and v3 groups. runtime dispatch on what you read from the group. (not for arrays. get an exception if you hit a v2 array)
        • perhaps 1774 (logging), 1773 (typedict/easy), …
        • testing! about to merge some work on group tests.
        • in v2 the test suite, new one will be tighter. e.g., a template that you can build on.
      • DB: but there are still APIs that aren’t figured out. v2 had APIs that weren’t intuitive coming from h5py, but weren’t good for performance.
      • DB: still need to answer the question “what is a zarr array?”
      • AS: there’s one in tensorstore, even if not perfect. (DB: happy to take issues like that)
      • DB: tests for the array API?
      • DB: (sidenote) slice a zarr and get a zarr (or future) without depending on dask.
      • DB: (sidenote) spec to define nouns not verbs…
  • LD: re: slicing any advice/challenges
    • efficient data structure for getting time-series slices and geo-graphic slices (region over time)
    • compute versus space
    • AS: weather forecast models. pre-calculate summary statistics possibly. partial reads help, but still a bit clunky. (similar to sharding which also helps though the zarr-python implementation is slow. tensorstore is fast.)
      • was using 1283 cube. went to 323 in shards. and got 4x speed up. (slower in zarr-python)
      • LD: colleague is pushing tiledb. lazy might help.
      • AS: tileDB too expensive
  • miscellaneous compression thoughts
    • https://www.blosc.org/pages/btune/
    • zarr optimize --force CLI? (Anyone?)
    • data engineering malpractice! (JPEG-like)
    • lossy compression
    • in-place codec conversion
    • oopsies. 1B calls for “does file exist”