20th October, 2021
Attending: Josh Moore, Davis Bennett, Dennis Heimbigner, Ward Fisher, Martin Durant, Ryan Williams, Gregory Lee
-
Davis:
- interested in
-
also have a branch for fixing how writing of chunks reads
- has a race condition in test_sync. Use Juan’s timeout.
- PR to be opened.
- interested in
-
Josh:
-
Investigating xarray interaction improvements with b-open
-
Also investigating sharding prototype API with scalable minds.
- May want to make use of a translation layer like kerchunk
-
-
Martin
- zarr is still the primary use for
- Talking to people, astro data will work only if very regular
- List of chunk sizes in dask array
- Could imagine implementing the dask scheme in zarr (breaks assumptions)
- Cause: Detectors of complicated sizes; mosaics with overlaps. Also in the time domain. (e.g. in geo & climate where 365 v 366 days causes a problem)
- Dennis: switching to **variable chunk size **would allow Zarr to support lazy rechunking – take a given chunk and over time divide it into smaller chunks (i.e. attractive). Martin: may need to move chunk names after re-indexing. cf. tiledb’s macrochunks & microchunks.
-
numcodecs
- interested to use entrypoints for declaring codecs
- kerchunk desires this for GRIB files (c-compiled thing, so writes to temporary local file; using codecs for this)
- imagecodecs from cgohlke
- Josh: still need a unique identifier
- Martin: in dask you would need to use an environment file for the right libraries.
-
Sparse?
- Josh: see Eric below. Expect to see a PR eventually.
- Davis: how is that not a codec?
- Martin: more a convention concern. Working with the awkward arrays.
-
PyData Global!
- Talk called “All you need is Zarr”
- many in the HDF5 community are interested in Zarr+Kerchunk to get cloud friendly storage. Basically a translation layer. Likely to need more codecs.
- Josh: strong relationship to the sharding work.
- Martin: metadata tricks can get you a lot without too many downsides. (Even though you’ll need to scan your data)
- May be pushing on this
- zarr is still the primary use for
-
Ward
-
Unidata moving forward with solicitation (v3 + parallelization)
-
Letter of collaboration? Josh: sure. Template → Logo → Good.
-
-
Martin: pangeo meeting pointed to a concern of there being no R implementation
- see
- Dennis talking to someone at the R group. Thinking about using
nczarr. (also for MATLAB)
-
Contact has been made.
- Internal issue for tracking. Will try to move it more publicly.
- see
-
Greg: preliminaries for V3 implementation
- https://github.com/zarr-developers/zarr-python/pull/789
(BaseStore)
- https://github.com/zarr-developers/zarr-python/pull/839
(Metadata handling)
-
Josh: Integrating them into an 2.11 alpha release
- Greg update
- https://github.com/zarr-developers/zarr-python/pull/789
-
Davis: re: dimension_separator travail
-
Move keys from strings to tuple of strings
- Playing around with that. But breaks the ability to use
dictionaries for storage
-
i.e. perhaps using dict for storage is important
-
Advantage of splitting prefix & chunk key is nice.
-
How sacred is dictionary as a store?
-
Josh: or a Key object?
- Dennis: nczarr uses tuples internally with on-the-fly conversion
of tuple ot string.
- Gotcha was converting from string to tuple (need heuristics)
- Davis: when is that needed? in S3 you ask for all strings below a certain point. accessing this individually requires parsing them to tuples.
- Josh: the “Hierarchy”-style with a cached test of .zarray existence would also work.
- Dennis: that’s preferable.
-