12th January, 2022
🎉 Happy New Year!
Attending: Davis Bennett, Josh Moore, John Kirkham, Greg Lee, Eric Perlman, Tobias Kölling, Hailey Johnson, Ward Fisher
-
Davis: Move forward with set_write_empty_chunks?
-
Eric: any metadata to say that it was written this way?
- John: not fatal just will be empty value.
- Josh: https://github.com/zarr-developers/zarr-python/pull/489
- DVB: different problem with shards.
-
Have a script to test performance
- Depends on latency, compressor, etc.
- Testing of emptiness isn’t the most performant
-
No objections. Moving forward with 2.11.
-
Anything for spec?
- JK: https://github.com/zarr-developers/zarr-python/blob/master/docs/spec/v2.rst#chunks is good to go.
-
-
Greg ok with 2.11? Yes. Nothing needs pulling or adding.
-
Need a release note. (TBD)
-
Spent some time on V3 PRs.
-
Sixth passes CI.
- Consolidated metadata now works (but we want that to be an
extension!)
-
Unimplemented stores
- ABSStore & N5Store
- Davis to review N5Store
- Do we need a Hierarchy object? (e.g. when you create a group you have to give it a path) – main difference (requires issues in tests due to create_store needing a path)
-
-
JK: Davis’ PR for trimming chunks?
- DVB: Languishing. Synchronization issue in appending with
co-processing
- JK: design decision to make appending easy. But we need to
design whether or not we will handle it in sharding.
- DVB: Languishing. Synchronization issue in appending with
-
JM: Moving forward with community manager position (2 years)
-
TK:
-
sharding
-
checksuming (structure looks like IPLD
-
JSON document with hash for chunks
-
how could we write out content addressable chunks
-
MutableMapping that’s write only (remembers content identifiers)
-
DVB: use case?
- nice if you want to calculate checksums, even for part
of an array. (helps if it’s in a merkle tree)
- makes it easier to share the data in multiple pieces.
you can have copies everywhere.
- interesting optimizations for big datasets: to get
difference between two variables, you want them defined on the same grid (even if you don’t care what the grid is)
- JK: cloud store that’s write-once, since you only need
to write each hash once. (bonus of time exploration feature with history of the chunks)
- TK: discovered over the holidays that the
content-addressable & the IPFS solutions could be the same.
- IPLD have special type that’s a link via a content-identifier
- JAM: “fsspec concern”? TK: reading yes, but for writing
it gets more difficult. No key-value concept. May also be useful to express the content-identifiers to a higher-level for optimizations.
- JK: explore implementing on top of MutableMapping
interface and that’s what Zarr uses. Naive idea, Zarr special key that has metadata but address of each chunk. Gets difficult since the top-special-key needs to be writable. Josh: perhaps that key is the Zarr. JK: grab all of them? request small JSON things from the cloud. Takes place of consolidated metadata? (“one big request”) TK: you would also need to write it out as well (only visible afterwards). JK: start running in to ACID. Tertiary concern, but lot of writing might create chunks that you no longer care about that need cleaning. ZHierarchy could be like UTC time (time since last debug) → git for the cloud.
- nice if you want to calculate checksums, even for part
-
-
-
netcdf-java
- On non-zarr stuff. (Big funded push is done now)
-
netcdf-fortran!
-
Zarr API is properly working there.
-
housekeeping, etc.
- Josh interesting issue from Julia community on order swapping
recently.
- https://github.com/zarr-developers/zarr_implementations/issues/42
-
-
JK:
-
Nvidia GPU-based Zarr to avoid host-memory transfer
-
Requires 2 pieces
-
CuPy part (JK needs to review that PR)
-
Compression (someone working on that now)
- DVB: which codecs? most basic ones. blosc is unclear.
DVB: super interested! preferably a simple one incl. Java clients, but primarily transfer to GPU is a bottleneck. Happy to test.
- https://github.com/NVIDIA/nvcomp
(snappy, etc.)
- DVB: unsigned bits. no compression automization. use
case is getting output of ML model, making predications on 3D arrays. (writing use case includes multi-resolution pyramids)
- TK: climate models - too much data that needs
compression (GRIB backed …) Trying to convince them to use Zarr. But floats.
- JK: happy to get a list of compressors. (on an issue or
via email) TK: currently investigating.
- DVB: which codecs? most basic ones. blosc is unclear.
-
-
-
JAM: data-apis?
- JK: not sure we want to be a provider, but as a consumer. being
a provider gets us into providing consumption.
- Ward: That is a long-standing issue with netCDF as well. re:
pressure to start performing computations w/in the netCDF library.
-
DVB: what happens with array inception? (dask-zarr-dask)
- JK: Martin adding entrypoint support into numcodecs, perhaps
something similar to say a zarr.array entrypoint
- JK: not sure we want to be a provider, but as a consumer. being