2023-06-14
Attending: Davis Bennett (DB), Josh Moore (JM), Alan Watson (AW), Dennis Heimbigner (DH), Iana (IA), Jeremy Maitin-Shepard (JMS), Norman Rzepka (NR), Sanket Verma (SV)
TL;DR:
JM started the meeting by sharing some important updates. DB initiated a discussion on race-condition. NR is looking for feedback on Zarrita. AW gave a presentation on custom Zarr storage classes. The attendees participated in a discussion post-presentation, providing their valuable insights.
Updates:
- JM
- Zarr-Python 2.15.0 release after two alphas - 2.15.0a1 and 2.15.0a2
- jzarr migration and upcoming 0.4.0
Meeting Minutes:
- DB: race-condition might point to need for
w+
mode- making array creation multi-processing friendly
- “w+ would mean “create a new array, but don’t worry about pre-existing stuff” (user’s responsibility)
- could lead to confusing indexing behavior (no chunks in his case since fault tolerant)
- would like the same thing for re-writing an array, but if chunks exist but they will only be interpretible with the same params
- DH: fix code to catch the exception if an object doesn’t exist? (directory wiping routine)
- DB: possibly. could mention on issue.
- no cost to delete what’s not there
- maybe don’t want zarr to touch the storage backend at all except for adding metadata
- DH: if you can’t recognize something is old, there’s a problem
- DB: can if break spec. semantics in the chunk name. prepend with hash of array metadata (self-describing: shape/chunk/compressor)
- DH: poor man’s versioning
- JMS: do have “chunk-key encoding” option, so you could include that. codec could also prepend a header.
- JMS: unclear how common the stale chunk
- JMS: also w+ might not be the best option since for fopen is to truncate
- DB: don’t need to stick to 2 char modes
- DB: separate access mode for metadata and chunks
- DB: Ryan Abernathey mentioned
- JMS: in tensorstore, you have an option to specify the metadata while opening (doesn’t read or write to disk)
- JMS: with zarr-python you can have a separate metadata store (e.g. memory store)
- NR: Feedback on zarrita sync/async API requested: https://github.com/scalableminds/zarrita/issues/4
- Advertisement for feedback. Thoughts welcome.
- Added tests, fixed bugs. Integrating into webknossos libraries to get field testing
- Tensorstore implementation in progress? JMS: Yes. Working on sharding.
- Would be good to test them against each other.
- JMS: hopefully finalizing in the next week or so
- Alan Watson (Brain Image Library, University of Pittsburgh)
- https://github.com/CBI-PITT/zarr_stores
- https://tinyurl.com/fMOST-MouseID-182725-zarr
- DB: link has multiscales info missing showing the volumes overlaid. (moves the origin. everyone assumes the translation)
- AW: thanks! (typically use precomputed format)
- Presentation (link to follow)
- Large number of files created (with only 20% size growth)
- Been working on a sharding implementation for over a year. Looking for feedback.
- CZI grant for bil_data_viewer
- Thanks for neuroglancer!
- 29TB example with 9M chunks but only 11000 shards
- order of magnitude in next year. 4 PB uncompressed / 2 PB compressed
- H5 Nested Store e.g. with H5 file per 3D volume (t x c number of shards)
- Discussion:
- writing? using dask distributed cluster for locking
- JM: sidenote – as a stoore is a warning sign
- JMS: benefit of H5 as sharded container over shard format? H5 was comfortable for colleagues. have also used zip files.
- getting lots of nested directory stores … no storage overhead.
- DB: think it’s cool. If you don’t do some it, so will. Some at Janelia are interested. to be a spec, you’d want it to be extensible / confiugrable with which dimension. to be a spec, you’d want it to be extensible / confiugrable with which dimension.
- (Sanket joins)
- JMS: in practice compressed chunk overwriting isn’t practical
- JMS: does fit with v3 but urge you to not use HDF5
- NR: this sounds like where we were with WK 10 years ago we were with WK 10 years ago
- AW: no need to use HDF5 but good point about the multi-thread/processes issue. did hair pulling with distributed locking (complicates the code)
- AW: any concerns with doing this possible?
- DB: if it’s your implementation and expose over HTTP then don’t care. happy with web API
- NR: easy to expose HDF as HTTP. good for backwards compatibility.
- AW: exactly,dynamically allows changing chunks, etc.
- AW: but am concerned with people downloading them
- JM: concern is if they leak out and are considered “OME-Zarrs”
- AW: been calling them “OME-Zarr-like”
- AW: happy to discuss a way to refer to them
- JMS: “HDF5-sharded” might help.
- JM: still feel that representing as “HDF5-sharded codec” would help us to then be explicit about support.
- DB: inevitable. don’t tie to a format. cfconventions says this is a data model.
- NR/JM: for a separate meeting, but disagreement
- JMS: codec would work in V3
- Summarizing
- NR: fresh look at ZEP2. If that doesn’t meet you needs, then open a new ZEP and go through the community process. Not sure if there are sufficient reasons to go outside of ZEP2.
- JMS: no technical benefits to HDF5 (perhaps chunk re-use is implemented) Cost of HDF5 libraries makes it not a great trade-off. Perhaps append a footer then it is the sharded format.
- AW: stepping back a bit, if we did a ZEP, then it wouldn’t be a use for HDF5 as a container, but a structure to use container formats regardless of their type
- AW: concern not in v2? correct.
- JM: Dennis?
- DMH: would likely ignore it
- JMS: dealing with HDF library would be tricky
- DB: Martin Durant is not here so should say that’s what kerchunk does
- DMH: parallel reading of chunks may be coming in HDF5
- AW: no issues with parallel read (write, yes!)
- Back to the NGFF data model conversation
- AW: codec as invalidating the spec? possibily.
- JMS: choice of a new format for features.
- DB: locking down codecs is unzarr like :smile: JM: but perhaps not un-NGFF-like