2024-02-22

Attending: Sanket Verma (SV), Ward Fisher (WF), Josh Moore (JM), Martin Durant (MD), Tom Nicholas (TN)

TL;DR:

The meeting covered the use of LLMs in training, feedback on the Zarr-Specs website redesign, progress on the V3 refactor, and discussions on integrating kerchunk into Zarr, focusing on chunk manifest standardization and virtualized array concatenation.

Updates:

HTTP Extension meeting: https://docs.google.com/document/d/14TJfrjbfU1R2REjrZ35GjV74MJ18j_m6geWdj0oB83Y/edit?usp=sharing

Meeting Notes:

LLMs and how WF is using them in trainings
Feedback for new design for Zarr-Specs website (combines ZEP and Zarr-Specs together)
- Link: https://docs-test-sanket.readthedocs.io/en/latest/
MD: How’s V3 refactor work going on these days?
- JM: Quite good progress taking place these days
- SV: V3 PRs can be found here - https://github.com/zarr-developers/zarr-python/pulls?q=is%3Apr+is%3Aopen+label%3AV3
- MD: https://zarr.dev/zeps/draft/ZEP0003.html
TN: Been discussing → https://github.com/zarr-developers/zarr-specs/issues/287
- interested in integrating kerchunk into zarr, especially two ZEPs
- (1) chunk manifest (Joe) - standardizing what chunk json files do
- (2) concatenation - https://github.com/zarr-developers/zarr-specs/issues/288
- 1. manifest: opinion that it’s an incredible idea that is very popular
    - fsspec relationship makes things complicated
    - move to the zarr spec for other implementations?
    - goal is readable in any language
    - difficult position
    - three things to think about
      - read byte ranges
      - write JSON
      - combine module
    - roadmap:
      - standardize json for the chunks. manifest file?
      - JM: storage in zarr array itself
      - JM: log file anytime you read a full file into memory
      - Josh: virtual zarr (access pattern)
- 1. concatenation
    - multi-zarr-to-zarr leads to a loop
    - more sense to think of concat of virtualized arrays objects
    - see kerchunk array notebook
    - read in byte ranges with kerchunk. array class which only stores byte-offset arrays in memory
    - can be done in xarray. concat-classes can be put into xarray and can use higher-order API
    - JM: store that xarray as a zarr :smile: (but need additional metadata for realizing the array)
    - TN: part of notebook that isn’t done. exactly.
    - common case in geo. multiple NC files, concat those array.
      - possibly compression options change over time.
      - prevents it from being one zarr array
    - JM: or just always serialize to the chunk manifest
    - JM: i.e. where do we stop? (when does Zarr become Turing Complete?)
    - TN: thought at concat (clear use case). but jeremy thought indexing (also clear use case)
    - JM: starting to sound like transforms (https://github.com/ome/ngff/pull/138#issuecomment-1948424000)
    - WF: periodically get requests for operations on the data
      - no one has come close to making the argument for adding that into the storage
      - so many math libraries that would do it better
      - TN: no computation since you don’t need the values. can do some subset of concat & indexing without values.
- TN: have now become a zarr producer :tada:
- JM: cross-language motivation
- SV: pyramiding ZEP discussions