2022-09-07

Attending: Sanket Verma (SV), Josh Moore (JM), Davis Bennett (DB), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Hailey Johnson (HJ), Brianna Pagán (BP), Isaac Virshup (IV)

TL;DR:

SV announced new ZEP bi-weekly meetings apart from regular bi-weekly community meetings. Now, the Zarr Community will be meetings two times every two weeks. New illustrations by Henning Falk, check them here. Google Summer of Code 2022 is finally coming to an end. After the updates, JM discussed having a hierarchy that builds a virtual n-dimensional array. Then, JMS started discussing one of the open issues in the zarr-specs repo; check the issue here.

Updates:

ZEP meetings will take place bi-weekly on Thursdays @ 21:30 IST/18:00 CEST/17:00 BST/12:00 EDT/9:00 PDT
- Instructions: https://zarr.dev/community-calls/
- More focused on the spec than these meetings
Check out our new illustrations here: https://github.com/zarr-developers/zarr-illustrations-falk-2022
- More ideas welcome!
- Sharding…
copy button for code snippets in Zarr documentation, check here: https://zarr–1124.org.readthedocs.build/en/1124/ @ Altay Sansal #PR1124
Approaching end of GSOC (12th of Sep)
- https://alt-shivam.github.io/Codecs-Registry
- Looking to participate in Outreachy https://outreachy.org/
- New potential users & developers

Open Agenda (add here 👇🏻):

JMS: plan for resolving v3 spec?
- SV: more on this tomorrow but some progress looking at open issues
- Upcoming work
- JM: proposal to have ZEP0001 moved to a “provisional ZEP” state (only blockers allowed)
- JMS: idea is no spec discussion at this meeting? SV: no, but we’ll communicate back and forth
SV: updates on Brianna/Hailiang’s ZEP? BP: Not yet. SV: also welcome to join tomorrow
IV: zarrR?
- https://github.com/fsspec/kerchunk/issues/212
- Josh:
JM: idea of having a hierarchy that builds a virtual n-dim array
- DB: adds brittleness. would say no.
- IV: kind of like kerchunk but with more indirection
- JMS: sometimes have use cases.
  - stacks of images that you want to view as an array, or multiple images acquired separately.
  - Do have stack driver in tensorstore (with specificed origins. No stored representation)
- DB: similar problem when acquired in HDF. wrote own layer.
- JMS: might should be a layer higher than zarr.
- DB: for bioimaging, if your app depends on this then you can only open HDF and Zarr and not other stuff.
  - doesn’t need to be compiled code. API problem.
NaN/inf/other special values in user-defined attributes: https://github.com/zarr-developers/zarr-specs/issues/141
- zarr-python supports by encoding in non-JSON-compliant way
- DB: nothing that can be stored as data shouldn’t be impossible to store as an attributes
- DH: was dealing with this recently. found “NaN” (quoted string) in existing datasets, expecting it to be treated as such. added support to nczarr (as well as unquoted versions)
- JM: will likely need a deprecation/warning/error cycle (royal pain)
- IV: keep JSON and use them as special values? nice that it is all just JSON.
- DH: nczarr (netcdf API) got ahead of this because typing is stored for attributes (“double”). possibility for v3?
- JMS: good point. perhaps decide the model for attributes in v3 (i.e. proposed change to v3 spec)
- JM: will need an upgrade path
- DB: haven’t seen untyped attributes, but just that JSON is missing values
- DH: so extend constants that are definable
- JM: BSON?
- IV: there are also things that can’t be encoded in zarr
- DH: one problem with extending JSON is that in C code that there are JSON parsers that would choke
  - JMS: zarr-python has essentially already done that
- Enumerating options:
  1. extend JSON parser (generally :-1:)
  2. support existing JSON-variants (BSON) (generally :-1:)
  3. encode objects in JSON a. {"attribute":... a. {"@type":...
  4. add type information somewhere else (like .nczarr)
- JM: (2) might be a metadata-driver like separate chunk-stores
  - DH: if it’s binary, then you need a good spec. and need to show equivalence between binary version and JSON.
  - JMS: you might be writing the non-JSON attribute late in the process, which would cause problems
  - DH: binary could help with speed since string level parsing is expensive
  - DB: always thought of the metadata as the stuff you want to read with editor and you don’t want peformance issues
  - DH: have had number of examples of NC-4 files that are enormous (10s of MB of metadata)
    - also abusing grouping for “namespaces” (even if not a good idea)
- IV: is this Zarr’s responsibility? cf. Pydantic which can turn your values into something else. (i.e. external schema)
  - DB: but Zarr is responsible for storing “fill_value”
    - IV: that’s .zarray rather than .zattrs
- DB: would assume that the .attrs property takes care of encoding/decoding
- JMS: would see saying “.zattrs supports JSON + these encodings”
- IV: do all the languages support this?
  - Javascript?
  - JMS: Array[UInt8Array]
- SV: an extension?
  - JMS: could fail on invalid JSON now and then add encoding/decoding later (since there’s already the issue with V2)
- IV: Arrow requires everything to be an arrow type (everything else is string with encoding)
  - DH: did that in netcdf-4
  - DB: sqlite is the same way
  - DH: include numpy with json type (from string)
    - IV: almost done with PR on awkward arrays (using this). depends on the JSONs
    - JMS: would make sense to standardize that (decide: pure JSON or extended JSON)
- IV: see https://github.com/scverse/anndata/pull/569
SV: heading to California next week for NumFOCUS & CZI summit (also NJ & NYC)