2024-02-22

Attending: Sanket Verma (SV), Ward Fisher (WF), Josh Moore (JM), Martin Durant (MD), Tom Nicholas (TN)

TL;DR:

The meeting covered the use of LLMs in training, feedback on the Zarr-Specs website redesign, progress on the V3 refactor, and discussions on integrating kerchunk into Zarr, focusing on chunk manifest standardization and virtualized array concatenation.

Updates:

Meeting Notes:

  • LLMs and how WF is using them in trainings
  • Feedback for new design for Zarr-Specs website (combines ZEP and Zarr-Specs together)
  • MD: How’s V3 refactor work going on these days?
  • TN: Been discussing → https://github.com/zarr-developers/zarr-specs/issues/287
    • interested in integrating kerchunk into zarr, especially two ZEPs
    • (1) chunk manifest (Joe) - standardizing what chunk json files do
    • (2) concatenation - https://github.com/zarr-developers/zarr-specs/issues/288
      1. manifest: opinion that it’s an incredible idea that is very popular
        • fsspec relationship makes things complicated
        • move to the zarr spec for other implementations?
        • goal is readable in any language
        • difficult position
        • three things to think about
          • read byte ranges
          • write JSON
          • combine module
        • roadmap:
          • standardize json for the chunks. manifest file?
          • JM: storage in zarr array itself
          • JM: log file anytime you read a full file into memory
          • Josh: virtual zarr (access pattern)
      1. concatenation
        • multi-zarr-to-zarr leads to a loop
        • more sense to think of concat of virtualized arrays objects
        • see kerchunk array notebook
        • read in byte ranges with kerchunk. array class which only stores byte-offset arrays in memory
        • can be done in xarray. concat-classes can be put into xarray and can use higher-order API
        • JM: store that xarray as a zarr :smile: (but need additional metadata for realizing the array)
        • TN: part of notebook that isn’t done. exactly.
        • common case in geo. multiple NC files, concat those array.
          • possibly compression options change over time.
          • prevents it from being one zarr array
        • JM: or just always serialize to the chunk manifest
        • JM: i.e. where do we stop? (when does Zarr become Turing Complete?)
        • TN: thought at concat (clear use case). but jeremy thought indexing (also clear use case)
        • JM: starting to sound like transforms (https://github.com/ome/ngff/pull/138#issuecomment-1948424000)
        • WF: periodically get requests for operations on the data
          • no one has come close to making the argument for adding that into the storage
          • so many math libraries that would do it better
          • TN: no computation since you don’t need the values. can do some subset of concat & indexing without values.
    • TN: have now become a zarr producer :tada:
    • JM: cross-language motivation
    • SV: pyramiding ZEP discussions