17th June, 2020

Attending: Matthias Bussonnier, Dennis, Lauren P, Alistair Miles, Josh Moore, Jon Kirkham, Ryan Williams, Matt McCormick, Ward Fisher, Martin Durant, Andrew Fulton

  • Standing items

    • Introductions: new joinees should feel welcome to introduce

      themselves and their interest in zarr/n5.

    • Lauren: Unidata intern. NetCDF Fortran/Zarr/Matlab. Atmospheric

      science background. Upcoming PhD in Ocean Science at Duke.

  • As time permits

    • Josh: Conventions/examples around of storing

      geojson within .zattrs?

      • V3 spec , .zattrs will be part of .zgroup/.zarray so far.

        • [*Should we stick with that, or consider

          changing*](https://github.com/zarr-developers/zarr-specs/issues/72)?

      • Geojson is a spec for defining shapes. Embed geojson into a zarr file? Anyone has tried that?

      • E.g., if there are shapes of interest within an image

      • Matthias: note the spec issue for joining zattrs and zgroup. Probably need to nudge people for strong views.

      • Josh: if something like geojson is used, then the attributes could become large.

      • Alistair: one benefit of a single file is having a single request on group/array creation. (Downside is one has to pull everything down.)

      • https://github.com/zarr-developers/zarr-specs/issues/72 should be raised at an upcoming spec meeting.

      • What’s the worst? Movie, many cells in each frame, draw around each of those.

      • Matthias: does the protocol allow storing json in files other than .zattrs?

      • Josh: Not yet, but Dennis has suggested it.

      • Alistair: protocol only restricts some (object storage)-keys, others can be written.

      • Matthias: but the others aren’t reachable.

      • Josh: non-listing use case. I.e. good to have all obj-keys “registered”

      • Alistair: have suggested to others to use the zattrs mechanism rather than new keys. (Not explicitly documented)

      • John: graduating .zgroup and .zattrs to directories with many json blobs.

      • Dennis: won’t help with the non-listing use case.

      • Matthias: unless you know the keys

      • Josh: could imagine registering new extensions (.geojson) in the .zgroup metadata for discoverability

      • Matthias: Locking?

      • Dennis: if you can’t search, then zgroups needs to list the subgroups (also subject to locking issue)

      • Alistair: situations where the store isn’t listable could be solved by some form of consolidation. Primary thing we want to support is that when you are creating arrays, that can be in parallel.

      • Alistair: user-driven process of choosing to use consolidated (if it even exists … Matthias: and is up-to-date?!)

      • Consolidated metadata in an extension.

      • Alistair: might build an hierarchy over days.

      • Matthias: cf. PNG that isn’t valid until its written.

      • Dennis: thinking we need to have at least gross level of file locking.

      • Alistair: typically create hierarchy in parallel, and when they are done, consolidate.

      • Dennis/Josh: period where consolidated metadata could be invalid in the file system store

      • Martin: for S3 everyone will see things at a different time.

        • Potentially could require versions (all the same,

          etc.) at a cost.

      • John: consolidated metadata came from wanting multiple keys in one request. If we’re using parallel get then perhaps it’s no longer relevant.

      • Matthias: other problem is listing stores. Should we disallow a non-listing store? Implementation of the store. Then can make zattrs a directory to get parallel write.

      • Alistair: multiple intersecting use cases:

        • What to do if not listable?

        • Xarray use-case of retrieving all metadata all arrays in

          a group upfront (original motivation for consolidated. Latency was the issue)

        • Eventual-consistency compatibility

        • Use case, supporting concurrent array creation

        • Use case, supporting concurrent writes into different

          regions of the same array

    • Josh: Anyone know (and trust) a generic zarr ←→ HDF5 converter?

    • Josh: Would there be any support behind unifying the API

      terminology for arrays? cf create_dataset …

      • Matthias note: the will be an effort in Open Team to have “PyData API” workgroup. Maybe that fits in there.
      • Alistair: modelled on numpy, but also wanted h5py compatibility. Good thing to try to unify (across PyData)
    • Matthias : did anyone seen tiledb ?

      (https://tiledb.com/)

      • Alistair: solving a similar problem as zarr. E.g. two writers writing different regions of an array could come into contention, possibly leading to data loss. Zarr-python has 2 ways of working around this: (1) organize writes so that they are aligned with chunk boundaries or (2) use a chunk-level locking mechanism. Tildb solves that in a different way without locking: files that get stored don’t correspond to chunks, but the write operations (time-stamped) and an index keeps track of them. Can reconstruct any point of time and don’t need to worry about chunk boundaries. Also has a background process to re-organize files. But less hackability.

      • Some discussion with TileDB authors in this issue: https://github.com/zarr-developers/zarr-python/issues/515

      • Alistair: Google recently released tensorstore on GCFS which has another feature: locking with eventual consistency. Each object has a generation number, incremented on each update. “Only write if my write is the next write in the sequence”, i.e. OptimisticLocking. Matthias: and neighboring chunks? Read all, then write all conditionally … Martin: but could get inconsistent … i.e. need transactions. Alistair: would be interesting to engineer that into the spec so that you think about writes as a retriable event. (Note: conditional writes are separate from versioning) Martin: potentially batch operations

      • John: recently saw something similar with timestamped-directories for the chunks. Should be able to perform unique writes. Bake it into the spec. [Listable as a special edge case….]

    • Matthias : Re, Partial Read.

      • Progressive loading (e.g. of an image). FFT low frequency gets loaded first. Related to decompressors with partial decompression. Ask blosc for new API to make use of a store that could seek.

      • Alistair: that’s one path, multi-step “which bytes? Here are the bytes” Seems complex. Impacts on codec and store API.

        • Trying to speed things up, but without loading too much

          data

        • Other path is storing things differently, different bits

          as an object.

      • Cf. https://cloudinary.com/blog/progressive_jpegs_and_green_martians

      • Matt: differences between ML & image use case

      • Alistair: shuffle operation is big

    • Matthias : Python Json does not always respect JSON “standard”.

      • Especially around Nan and Inf. Need to be careful for other languages.