2023-06-14

Attending: Davis Bennett (DB), Josh Moore (JM), Alan Watson (AW), Dennis Heimbigner (DH), Iana (IA), Jeremy Maitin-Shepard (JMS), Norman Rzepka (NR), Sanket Verma (SV)

TL;DR:

JM started the meeting by sharing some important updates. DB initiated a discussion on race-condition. NR is looking for feedback on Zarrita. AW gave a presentation on custom Zarr storage classes. The attendees participated in a discussion post-presentation, providing their valuable insights.

Updates:

  • JM

Meeting Minutes:

  • DB: race-condition might point to need for w+ mode
    • making array creation multi-processing friendly
    • “w+ would mean “create a new array, but don’t worry about pre-existing stuff” (user’s responsibility)
    • could lead to confusing indexing behavior (no chunks in his case since fault tolerant)
    • would like the same thing for re-writing an array, but if chunks exist but they will only be interpretible with the same params
    • DH: fix code to catch the exception if an object doesn’t exist? (directory wiping routine)
      • DB: possibly. could mention on issue.
      • no cost to delete what’s not there
      • maybe don’t want zarr to touch the storage backend at all except for adding metadata
    • DH: if you can’t recognize something is old, there’s a problem
      • DB: can if break spec. semantics in the chunk name. prepend with hash of array metadata (self-describing: shape/chunk/compressor)
      • DH: poor man’s versioning
      • JMS: do have “chunk-key encoding” option, so you could include that. codec could also prepend a header.
      • JMS: unclear how common the stale chunk
      • JMS: also w+ might not be the best option since for fopen is to truncate
      • DB: don’t need to stick to 2 char modes
      • DB: separate access mode for metadata and chunks
      • DB: Ryan Abernathey mentioned
      • JMS: in tensorstore, you have an option to specify the metadata while opening (doesn’t read or write to disk)
      • JMS: with zarr-python you can have a separate metadata store (e.g. memory store)
  • NR: Feedback on zarrita sync/async API requested: https://github.com/scalableminds/zarrita/issues/4
    • Advertisement for feedback. Thoughts welcome.
    • Added tests, fixed bugs. Integrating into webknossos libraries to get field testing
    • Tensorstore implementation in progress? JMS: Yes. Working on sharding.
    • Would be good to test them against each other.
    • JMS: hopefully finalizing in the next week or so
  • Alan Watson (Brain Image Library, University of Pittsburgh)
    • https://github.com/CBI-PITT/zarr_stores
    • https://tinyurl.com/fMOST-MouseID-182725-zarr
      • DB: link has multiscales info missing showing the volumes overlaid. (moves the origin. everyone assumes the translation)
      • AW: thanks! (typically use precomputed format)
    • Presentation (link to follow)
      • Large number of files created (with only 20% size growth)
      • Been working on a sharding implementation for over a year. Looking for feedback.
      • CZI grant for bil_data_viewer
      • Thanks for neuroglancer!
      • 29TB example with 9M chunks but only 11000 shards
      • order of magnitude in next year. 4 PB uncompressed / 2 PB compressed
      • H5 Nested Store e.g. with H5 file per 3D volume (t x c number of shards)
    • Discussion:
      • writing? using dask distributed cluster for locking
      • JM: sidenote – as a stoore is a warning sign
      • JMS: benefit of H5 as sharded container over shard format? H5 was comfortable for colleagues. have also used zip files.
        • getting lots of nested directory stores … no storage overhead.
      • DB: think it’s cool. If you don’t do some it, so will. Some at Janelia are interested. to be a spec, you’d want it to be extensible / confiugrable with which dimension. to be a spec, you’d want it to be extensible / confiugrable with which dimension.
      • (Sanket joins)
      • JMS: in practice compressed chunk overwriting isn’t practical
      • JMS: does fit with v3 but urge you to not use HDF5
      • NR: this sounds like where we were with WK 10 years ago we were with WK 10 years ago
      • AW: no need to use HDF5 but good point about the multi-thread/processes issue. did hair pulling with distributed locking (complicates the code)
      • AW: any concerns with doing this possible?
        • DB: if it’s your implementation and expose over HTTP then don’t care. happy with web API
        • NR: easy to expose HDF as HTTP. good for backwards compatibility.
        • AW: exactly,dynamically allows changing chunks, etc.
        • AW: but am concerned with people downloading them
        • JM: concern is if they leak out and are considered “OME-Zarrs”
        • AW: been calling them “OME-Zarr-like”
        • AW: happy to discuss a way to refer to them
        • JMS: “HDF5-sharded” might help.
        • JM: still feel that representing as “HDF5-sharded codec” would help us to then be explicit about support.
        • DB: inevitable. don’t tie to a format. cfconventions says this is a data model.
          • NR/JM: for a separate meeting, but disagreement
        • JMS: codec would work in V3
      • Summarizing
        • NR: fresh look at ZEP2. If that doesn’t meet you needs, then open a new ZEP and go through the community process. Not sure if there are sufficient reasons to go outside of ZEP2.
        • JMS: no technical benefits to HDF5 (perhaps chunk re-use is implemented) Cost of HDF5 libraries makes it not a great trade-off. Perhaps append a footer then it is the sharded format.
        • AW: stepping back a bit, if we did a ZEP, then it wouldn’t be a use for HDF5 as a container, but a structure to use container formats regardless of their type
        • AW: concern not in v2? correct.
        • JM: Dennis?
          • DMH: would likely ignore it
          • JMS: dealing with HDF library would be tricky
          • DB: Martin Durant is not here so should say that’s what kerchunk does
          • DMH: parallel reading of chunks may be coming in HDF5
          • AW: no issues with parallel read (write, yes!)
      • Back to the NGFF data model conversation
        • AW: codec as invalidating the spec? possibily.
        • JMS: choice of a new format for features.
        • DB: locking down codecs is unzarr like :smile: JM: but perhaps not un-NGFF-like