22nd April, 2020

Attending: RyanW, JoshM, JohnK, WardF, DennisH, Joseph Sheedy, JamesB, StephanS, RyanA, Ralf Gommers, Jonathan Leftman

Apologies:

  • Standing items

    • Introductions: new joinees should feel welcome to introduce

      themselves and their interest in zarr/n5.

    • Joseph Sheedy (Saildrone): autonomous boats for ocean data.

      Weather forecasting team. Looking to replace backend with zarr

    • Jonathan Lefman (Nvidia): Proprietary formats & instruments

      vendors

      • Digital pathology → DICOM, Bio-Formats users. Need to keep metadata.
      • Will send an invitation. Get him on the calendar invite.
  • Dennis/NetCDF update - 20:10 UK

    • See:

      https://github.com/Unidata/netcdf-c/tree/nczarr-merge.wif

    • Committed to supporting zarr (or variant) as a backend.

    • Running about 1.5 months behind, but have experimental working

      implementation in netcdf-c. Pretty complete impl. of v2. Want to shoehorn all of CDM.

    • URL (http/s, s3, file) includes #attributes to specify zarr &

      storage format:

      • S3: tested on non-AWS S3 at UCAR using AWS C++ SDK. Prefers Path format..
      • Interoperability attempt: file directory-based. Every object is a dir. Zipped it should be consistent with ZipStore
      • Netcdf4 format used internally for testing. Single-file storage that’s readable and writable.
    • Modes

      • nczarr : used C backend
      • zarr : code reads and writes pure v2 format (simulates some things)
      • s3 :

Image

    • Issues:

      • Strings
      • User-defined types (enums, vlen, etc)
      • Unlimited dimensions (multiples allowed)
      • Filters
    • Q: RyanA- https scheme? DMH: then need to pass arguments to

      choose e.g. s3. As opposed to using s3:// which makes defaults. Ryan: in python, just serve a zarr directly. DMH: to be explored. What protocol? Just remote file access. Don’t have it but should be straight-forward. RyanA: quite useful since it allows anyone to become a data provider. (But not yet spelled out in the spec) Consolidated metadata is important since you don’t have to do a list on the store.

      • DH: tried to avoid S3 prefix search. One of the ways that this isn’t zarr is by injecting new objects starting with “.nc”. Contain information needed for a faithful translation, but each has a list of what is contained (arrays, dimensions, groups). Pointers to what’s contained. Midway point between consolidated and directory listings. Important for performance. Didn’t want to read all the metadata all in one shot. More JIT.
      • Stephan: how large can the consolidated metadata before it gets slow? Fast if in the cloud at 100MBs.
      • Joseph: had perf. Issues with GB json files.
    • Q: John / Is this the service you are referring

      to?
https://www.quantum.com/en/products/object-storage/?

    • Q: John / how is char different from byte etc.? Poor man’s

      string. Hopefully only until string is implemented.

    • Q: John / unlimited is ragged? No. Vlen (from HDF5) is ragged.

      Unlimited allows to expand. All uses of the unlimited dimension have to have the same size. Ryan A: all dimensions in Zarr are potentially appendable. John: perhaps only in Python but not in the spec.

    • Q: John / compression – some work done in

      https://github.com/constantinpape/zarr_implementations to align all the implementations

      • DMH: would like to keep the ability to use the existing HDF5 filter system. Not necessarily ideal, but it exists and there are a lot of them.

      • DMH: situation on defining filter API in v3? Still room for compromise. James: ongoing. Don’t have a definite v3 spec. Codec specs are being worked on now. Room for input.

      • DMH: filters in HDF5 have some form of identifier and they are in a registry. Parameters are passed (e.g. inflate level) all as strings. … Concerned with how to write a C standard, find filters, load it, initialize it, invoke it. HDF5 model at least worked.

      • DMH: will python coders invoke filters written in C. John: should support them, but writers may need to do specific work to make it happen. DMH: May need to claim some primacy because if it works in C then it’ll work in other languages.

        • RyanA: something like numcodecs but “numfilters”

        • Stephan: numcodecs – difference between filters and

          codecs? (in v2 it’s undefined) DMH: perhaps just their purpose

        • Stephan: have compressor API in n5. Chunks are byte

          streams. Those are input and output. But could imagine that a filter should know about target types. (Type of the pointer) DMH: yeah, that’s the point of the parameters.

        • James:

          https://numcodecs.readthedocs.io/en/stable/delta.html#module-numcodecs.delta
^ Specific filter in numcodecs
The dtype needs to be specified

        • (Josh missed a bit here)

        • Stephan: compressors also have various parameters, so

          coudl filters and compressors become the same interface?

    • Q: Jonathan / sparsity – large volume with empty areas.

      Unnecessary to store them. Finding correct chunk size might be death by a 1000 cuts.

    • C: RyanA / thanks to NetCDF. Great to hear. Need to have that

      discussion move to the spec repo. Lots of talking needing to go on. DMH: add objects, but vanilla zarr implementation could ignore the .nc files. One way of dealing with extensions is define set of prefixes for object name which can be ignored. RyanA: would like extensions to be developed in a community-driven way. E.g. named dimensions but perhaps without full netcdf. Make them composable. We need to look at the nczarr extensions and discuss and iterate.

  • RyanA: report on CEOS WGISS Meeting: 20:55 UK

  • RyanA: Meeting OGC people on Friday

    • One negative is that it’s not certified by any body. (spec is

      “self-certified”)

    • Discussing there having a standards body

    • DMH: netcdf was certified by NASA some time back.

    • Josh: ISO/biobank investigation ongoing

    • Boring but important.
  • Stephan?

    • Specifying filter would be nice.

    • James: unfamiliar with these points. Happy to discuss

    • Stephan: combine filter & codec interface. Hard to use

      compressors outside of zarr.

    • Add new implementations at “VM” startup. No master file.
  • NumFOCUS newsletter

  • Ralf: redoing numpy documentation (into How-tos) incl. reading/writing. Melissa on CZI grant is working on that.

  • Ralf: talking to NumFOCUS about overheads

    • Talking to Stefan v. Walt on the board.

    • See: “Financial” google drive to get an overview of the CZI pot

      of money.

  • Joseph: gridded model data. Previously traditional NetCDF. Moving to zarr with S3.

    • Having performance issues with mobile/web APIs in front of zarr.

    • Trying to reduce latency.

    • Will hopefully have something to present on optimization.

    • Opened a few caching issues. John: was going to bring up.

      LRUcache? Yes. There’s also a cache for the decompressed data. Joseph: cached compressed data is fine.

    • Balancing access times of cubed spatial measurements. General

      chunked problem.

    • John: did you see getitems()? [*Async in

      zarr*](https://github.com/zarr-developers/zarr-python/issues/536)

    • Using xarray … with dask.

    • Josh: would have used nczarr if it were already ready?

    • Joseph: tried tiledb. Wasn’t far enough along.
  • Upcoming/Future topics

  • Done: 21:25 UK