27th July, 2022

Attending: Josh Moore (JM), Brianna Pagán (BP), Ryan Abernathey (RA), Norma Rzepka (NR), Jeremy Maitin-Shepard (JMS), Greg Lee (GL), Trevor Manz (TM), Ward Fisher (WF), Davis Bennett (DB), Matt McCormick (MM), John Kirkham (JK), Parth Tripathi (PT)

Agenda:

  • ZEP0001
    • RA: ongoing review process :tada:
    • JMS: long-list. perhaps we should just go through them
    • NR: higher-level – status of the extensions? going in? ZEP0002, 3, 4…
      • RA: sets ground work for extensions. like the idea of keeping them narrow in scope
      • NR: worried about a ZEP a month and the happiness of the ZIC. Perhaps batching them?
      • JMS: review sharding as part of ZEP0001 since it was the motivation for many to have gotten involved. main benefit of V3
      • MM: on sharding, would like to look towards the future (i.e. not necessarily finalization) to get it adopted across the implementations
      • JM: not a lot of movement (speaking for others). definitely implementation needs work.
      • NR: Jonathan is in stand-by waiting for decision. Could be ZEP0002. (He has a conflict at this time)
        • MM: Great comments https://github.com/thewtex/shardedstore/issues/17
        • Using sharding in a general way for simplicity, incl. with different stores.
        • Looking to go through this in practice for large scale data.
          • See working prototype. Pretty efficient. Works with v2 as well out of the box
        • DB: understanding sharding as introducing an abstraction between the array and the store. Will that generalize to all non-sharded stores. (No-op shard?)
        • NR: yes. Store shouldn’t need to know about the storage transformers. Partial reads are helpful but not required.
      • NR: at specification level (i.e. not just zarr-python) need to know how it will look like on disk.
      • MM: could see trying to get ZEP0001 out. (Proposal?)
        • but also: yes you can shard arrays, but what about groups (as additional need for the spec)
        • useful for content-addressable/verifiable storage
        • unrelated to all of the hierarchical formats
        • separate shards per scale along with the related metadata (same for xarray)
      • JMS: in the interest of getting ZEP0001, perhaps we hold off on sharding. as a delta to the
    • tl;dr
      • ZEP0001: focus on getting current work done but include storage transformer (:+1:)
      • ZEP0002: Jonathan to start ~next week (making necessary adjustments to ZEP0001)
        • MM to comment on PR or open alternative proposal
      • then in that same batch or ZEP0003 definition of extensions
    • RA: process – inventing it as we go. Bootstrapping so there will likely be a lot of feedback on how things work, but try to use that structure for the moment.
    • JMS: on the outstanding issues
      • fill values: consensus that it must be specified?
        • JM: replaces DB’s smart fill value logic?
        • DB: clients can have a mapping or a callable, but it wasn’t easy to make it work with the semantics (in the zarr-python)
        • DB: easier if we make it required. gets past fundamental ambiguity
        • JM: the upgrade scripts will need to be aware of this too (EOSS4)
        • DB: 0 as sane default for uint8? etc. etc.
        • consensus: require fill value be specified
      • case sensitivity
        • v3 says “case sensitive”, reasonable except on e.g. OSX.
        • Add a note?
        • Alternative of escaping? (add-on)
        • WF: file-system rather than OS bound (despite tight correlation per platform). NC doesn’t assume the technical doubt of working around file system limitations. Ergo? User consideration, not something that can be fixed technically.
      • path structure
        • see previous meetings
        • Options
          • (A) removing “meta/” as nicer paths to metadata files
            • Con: doesn’t work for the consolidated metadata path
              • JM: Workaround with “symlinks”
          • (B) require suffix on the root (“.zarr”)
          • (C) syntax for combining path and key: path//key, path#key, etc.
        • JM: recently ran into a need/use-case for something like (A). import into OMERO, easier to work with the metadata as the main hierarchiy.
        • RA: good to think about having kerchunk style references encodable in Zarr
          • discussions at scipy
          • vibe, “why a new format?”
          • needed the ability to have references to blocks in other files
        • MM: “composite store” (like a sharded store, could also add in kerchunking possibly)
          • adds in layer of indirection, doing indexing. tells you what’s present.
          • would need to be more well-supported than consolidated metadata.
          • JM: how does it differe? MM: more flexibility in how it is broken up
          • MM: very large dataset and doing analysis on one part of the dataset then that can be updated independently.
          • JM: similar to Dennis’ consolidate per hierarchy level Yeah, like an octtree.
        • NR: but doesn’t solve path
        • JMS: if consolidated is a concern, then (A) won’t work
          • JM: plus symlink should work.
          • JMS: proposing to drop “meta”.
          • RA: currently fsspec handles it. could be a formal URL scheme.
          • JMS: some issues with URLs if you’re opening RW
          • RA: could see only RO to begin with.
      • root directory
        • reasoning is having a non-empty name
        • JMS: just have “.zarr” as the name? best to skip that. (hidden files were an issue for V2)
        • RA: talked about having the root document be a json file.
        • JMS: special name?
      • difference between .zgroup and .zarray
        • leads to potential race condition
        • DB: used to attributes.json from n5 and have never had an issue with it
        • NR: easier if it’s one name to not need to do two look ups
      • endianness
        • JMS: currently include types <i2, etc. more logical to say data types are logical (16-bit-signed-integers)
        • JMS: make it a codec issue to deal with endianness since it only matters for raw encoding
        • NR: need to specify it somewhere, even if just in the codec. blosc would need to know (anything byte based)
        • JMS: filter rather than a data type
        • NR: downside of having it in the datatype? codec could ignore it. (JMS: happens at the moment)
        • JMS: numpy’s endianness is a bit unusual. often you want to just use it in the native endianness.
        • (Lots of nodding from Trevor)
        • JMS: main benefit is to always give the user native endianness
      • boolean/complex
        • people were happy to have them (yes?)
        • MM: boolean as 1 bit or byte? JK: one byte. no bitpiacking. (That could be a codec)
          • vector bool as an example of over-optimized
  • rawcodec (DB)
    • never need to say “None”
    • “raw”? intuitive?
    • “identifiy”, “noop”, “dummy”, “pass-through”
    • JMS: similar to endianness. combining 2 things in codec. codec gets an array and not a stream of bytes. could arguably be split
    • DB: separate configuration for each?
    • JK: similar to filter vs. codec, not well spelled out in the spec. See Categorize for an example.
    • JM: would make the choice to avoid compression explosion (e.g. for images)
      • JK: there’s already a meterological compressor…
    • JMS: linear chain of filter with a codec has issues
      • current way to do it would be to encode a byte stream and use a compressor
      • perhaps want separate compressors for different parts
      • could the filter itself have additional filters/compressors for the labelled data vs. the indices
      • JK: use cases? JMS: variable length strings, multisets, downsampling segmentations (similar to large number of categories)
      • JMS: should be easy to fit it in now. have a tree and the filter becomes the codec.
      • DB: filter vs codec? why a tree rather than an array?
      • JK: original use case of filter is categorize (RGB) -> (012)
        • filter as a transformation (on ndarray)
        • DB: different type signatures?
        • JMS: effectively not different in V2
        • JK: mostly a terminology thing
        • DB: “pipeline”
        • TM: is “raw” just an empty list?
      • JK: look at how parquet does it? (Ask Martin perhaps)
      • TM: one pipeline with inputs/outputs for each codec then you could encode numpy/bytes as desired an confirm that it’s valid
      • JMS: one codec location to an array? (nodding)
        • JK: do we have chained codec use cases
        • DB: someone at Janelia was working on that for segmentation of volumes
          • similar to categorical
          • see related paper. “gzip on top”
        • JMS: similar to something in neuroglancer
        • TM: bitshuffle/gzip for kerchunking? (to read HDF5 file)
        • DB: semantics come from HDF5