2023-03-09
Attending: Hailiang Zhang (HZ), Dieu My Nguyen (DMN), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS)
TL;DR:
The meeting started with a discussion on Sharding as a transformer vs codec. Sharding is implemented as a transformer, and then we discussed the pros and cons of implementing Sharding as a codec. JMS had briefly discussed #219. After this, HZ gave a brief on ZEP0005 but couldn’t present his slides due to Zoom/screen sharing issues - tabled for the next community meeting.
Meeting Minutes:
- JMS: sharding as transform vs. codec
- https://github.com/zarr-developers/zarr-specs/issues/220
- RA: been playing with sharding, trying to get the most out of it
- Key question (impl or spec?): if sharding is a codec how does the outer layer which range it wants? (general problem in zarr-python for blosc)
- requires passing context to context
- transform is explicit; codec is less clear
- Key question (impl or spec?): if sharding is a codec how does the outer layer which range it wants? (general problem in zarr-python for blosc)
- JMS: in zarrita he has the codec take an indexing expression (optional?). defer some of that for ZEP2? arrays vs. bytes vs. additional concept of arrays.
- RA: similar to Martin’s request
- JMS: first codec is fine, but the next one is less clear. need to be more explicit about the interface.
- RA: need to solve this. what information needs to be passed in between (implementors and at spec level)
- RA: e.g. could be a codec that takes an HDF5 file (blosc2, etc.) missed a chance to build the right abstractions there.
- JMS:
codecs := array|bytestream in; array|bytestream out
- JM: recursive zarrs all the way down?
- JMS: concatenation of other arrays
- RA: Norman’s justification. JMS’ proposal. re: how to integrate other things like referncing between arrays, shards defining own chunking, etc. (doesn’t change anything in ZEP1)
- JMS: transforms as bytes, and codecs can access arrays
- JMS: NB: MD wants low level store to be aware of array indexing
- JM: always thought of codecs as the lowest thing that is unaware of arrasy
- JMS: combined compression with filters (which can operate on arrays, transpose)
- RA: sharding fundamentally breaks core abstraction between store / codec. at the impl. level, want an efficient/fast code to fetch chunks of shard, make smart decisions, close to the metal. but the naive thing isn’t fast. do the core abstractions break down. no longer using key/value store API. using offsets into storage.
- JMS: don’t see byte range as breaking. addition to the interface.
- RA: not just a file format, but a protocol for addressing chunks.
- JMS: dimension names metadata
- https://github.com/zarr-developers/zarr-specs/issues/219
- RA: would be for keeping or making it easy to not use
- HZ/DMN: ZEP5 presentation (recorded)
- https://zarr.dev/zeps/draft/ZEP0005.html
- HZ: Tabling because of Zoom issues.
- RA: re: expectations – very limited due to the numbers of people working on the spec. (it’s taken years) so … 6 months?
- HZ: this is an extension, doesn’t blocking anything.