22nd April, 2020
Attending: RyanW, JoshM, JohnK, WardF, DennisH, Joseph Sheedy, JamesB, StephanS, RyanA, Ralf Gommers, Jonathan Leftman
Apologies:
-
Standing items
- Introductions: new joinees should feel welcome to introduce
themselves and their interest in zarr/n5.
- Joseph Sheedy (Saildrone): autonomous boats for ocean data.
Weather forecasting team. Looking to replace backend with zarr
- Jonathan Lefman (Nvidia): Proprietary formats & instruments
vendors
- Digital pathology → DICOM, Bio-Formats users. Need to keep metadata.
- Will send an invitation. Get him on the calendar invite.
- Introductions: new joinees should feel welcome to introduce
-
Dennis/NetCDF update - 20:10 UK
- See:
-
Committed to supporting zarr (or variant) as a backend.
- Running about 1.5 months behind, but have experimental working
implementation in netcdf-c. Pretty complete impl. of v2. Want to shoehorn all of CDM.
- URL (http/s, s3, file) includes #attributes to specify zarr &
storage format:
- S3: tested on non-AWS S3 at UCAR using AWS C++ SDK. Prefers Path format..
- Interoperability attempt: file directory-based. Every object is a dir. Zipped it should be consistent with ZipStore
- Netcdf4 format used internally for testing. Single-file storage that’s readable and writable.
-
Modes
- nczarr : used C backend
- zarr : code reads and writes pure v2 format (simulates some things)
- s3 :
- See:
Image
-
-
Issues:
- Strings
- User-defined types (enums, vlen, etc)
- Unlimited dimensions (multiples allowed)
- Filters
- Q: RyanA- https scheme? DMH: then need to pass arguments to
choose e.g. s3. As opposed to using s3:// which makes defaults. Ryan: in python, just serve a zarr directly. DMH: to be explored. What protocol? Just remote file access. Don’t have it but should be straight-forward. RyanA: quite useful since it allows anyone to become a data provider. (But not yet spelled out in the spec) Consolidated metadata is important since you don’t have to do a list on the store.
- DH: tried to avoid S3 prefix search. One of the ways that this isn’t zarr is by injecting new objects starting with “.nc”. Contain information needed for a faithful translation, but each has a list of what is contained (arrays, dimensions, groups). Pointers to what’s contained. Midway point between consolidated and directory listings. Important for performance. Didn’t want to read all the metadata all in one shot. More JIT.
- Stephan: how large can the consolidated metadata before it gets slow? Fast if in the cloud at 100MBs.
- Joseph: had perf. Issues with GB json files.
- Q: John / Is this the service you are referring
- Q: John / how is char different from byte etc.? Poor man’s
string. Hopefully only until string is implemented.
- Q: John / unlimited is ragged? No. Vlen (from HDF5) is ragged.
Unlimited allows to expand. All uses of the unlimited dimension have to have the same size. Ryan A: all dimensions in Zarr are potentially appendable. John: perhaps only in Python but not in the spec.
- https://zarr.readthedocs.io/en/v2.0.1/tutorial.html#resizing-and-appending
- DMH: colored by having started with HDF5
- Josh: May need to be added to the spec
- John: perhaps not allowed in all dimensions.
- Q: John / compression – some work done in
https://github.com/constantinpape/zarr_implementations to align all the implementations
-
DMH: would like to keep the ability to use the existing HDF5 filter system. Not necessarily ideal, but it exists and there are a lot of them.
-
DMH: situation on defining filter API in v3? Still room for compromise. James: ongoing. Don’t have a definite v3 spec. Codec specs are being worked on now. Room for input.
-
DMH: filters in HDF5 have some form of identifier and they are in a registry. Parameters are passed (e.g. inflate level) all as strings. … Concerned with how to write a C standard, find filters, load it, initialize it, invoke it. HDF5 model at least worked.
-
DMH: will python coders invoke filters written in C. John: should support them, but writers may need to do specific work to make it happen. DMH: May need to claim some primacy because if it works in C then it’ll work in other languages.
-
RyanA: something like numcodecs but “numfilters”
- Stephan: numcodecs – difference between filters and
codecs? (in v2 it’s undefined) DMH: perhaps just their purpose
- Stephan: have compressor API in n5. Chunks are byte
streams. Those are input and output. But could imagine that a filter should know about target types. (Type of the pointer) DMH: yeah, that’s the point of the parameters.
- James:
https://numcodecs.readthedocs.io/en/stable/delta.html#module-numcodecs.delta ^ Specific filter in numcodecs The dtype needs to be specified
-
(Josh missed a bit here)
- Stephan: compressors also have various parameters, so
coudl filters and compressors become the same interface?
-
-
- Q: Jonathan / sparsity – large volume with empty areas.
Unnecessary to store them. Finding correct chunk size might be death by a 1000 cuts.
- C: RyanA / thanks to NetCDF. Great to hear. Need to have that
discussion move to the spec repo. Lots of talking needing to go on. DMH: add objects, but vanilla zarr implementation could ignore the .nc files. One way of dealing with extensions is define set of prefixes for object name which can be ignored. RyanA: would like extensions to be developed in a community-driven way. E.g. named dimensions but perhaps without full netcdf. Make them composable. We need to look at the nczarr extensions and discuss and iterate.
-
-
RyanA: report on CEOS WGISS Meeting: 20:55 UK
-
Working Group on Information Systems and Services, space
agencies
-
RyanA: Meeting OGC people on Friday
- One negative is that it’s not certified by any body. (spec is
“self-certified”)
-
Discussing there having a standards body
-
DMH: netcdf was certified by NASA some time back.
-
Josh: ISO/biobank investigation ongoing
- Boring but important.
- One negative is that it’s not certified by any body. (spec is
-
Stephan?
-
Specifying filter would be nice.
-
James: unfamiliar with these points. Happy to discuss
- Stephan: combine filter & codec interface. Hard to use
compressors outside of zarr.
- Add new implementations at “VM” startup. No master file.
-
-
NumFOCUS newsletter
- “The Zarr community is currently developing version 3 of its
core specification (the current status is available [here](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/))). If you’re interested in helping shape the future of Zarr, now is a great time get involved! Discussion and development of the specification is taking place in the [zarr-specs repository](https://github.com/zarr-developers/zarr-specs/) and engagement from the broader scientific computing community is welcome.”
- Potentially adding governance?
- “The Zarr community is currently developing version 3 of its
-
Ralf: redoing numpy documentation (into How-tos) incl. reading/writing. Melissa on CZI grant is working on that.
-
Ralf: talking to NumFOCUS about overheads
-
Talking to Stefan v. Walt on the board.
-
See: “Financial” google drive to get an overview of the CZI pot
of money.
-
-
Joseph: gridded model data. Previously traditional NetCDF. Moving to zarr with S3.
-
Having performance issues with mobile/web APIs in front of zarr.
-
Trying to reduce latency.
-
Will hopefully have something to present on optimization.
- Opened a few caching issues. John: was going to bring up.
LRUcache? Yes. There’s also a cache for the decompressed data. Joseph: cached compressed data is fine.
- Balancing access times of cubed spatial measurements. General
chunked problem.
- John: did you see getitems()? [*Async in
zarr*](https://github.com/zarr-developers/zarr-python/issues/536)
-
Using xarray … with dask.
-
Josh: would have used nczarr if it were already ready?
- Joseph: tried tiledb. Wasn’t far enough along.
-
-
Upcoming/Future topics
-
Jonathan: DICOM & Zarr
- Josh: Nested+S3 –
- John: infinitely solvable.
- Filter specification (in spec repo)
-
- Done: 21:25 UK