20th October, 2021

Attending: Josh Moore, Davis Bennett, Dennis Heimbigner, Ward Fisher, Martin Durant, Ryan Williams, Gregory Lee

  • Davis:

  • Josh:

    • Investigating xarray interaction improvements with b-open

    • Also investigating sharding prototype API with scalable minds.

      • May want to make use of a translation layer like kerchunk
  • Martin

    • zarr is still the primary use for

      kerchunk

      • Talking to people, astro data will work only if very regular
      • List of chunk sizes in dask array
      • Could imagine implementing the dask scheme in zarr (breaks assumptions)
      • Cause: Detectors of complicated sizes; mosaics with overlaps. Also in the time domain. (e.g. in geo & climate where 365 v 366 days causes a problem)
      • Dennis: switching to **variable chunk size **would allow Zarr to support lazy rechunking – take a given chunk and over time divide it into smaller chunks (i.e. attractive). Martin: may need to move chunk names after re-indexing. cf. tiledb’s macrochunks & microchunks.
    • numcodecs

      • interested to use entrypoints for declaring codecs
      • kerchunk desires this for GRIB files (c-compiled thing, so writes to temporary local file; using codecs for this)
      • imagecodecs from cgohlke
      • Josh: still need a unique identifier
      • Martin: in dask you would need to use an environment file for the right libraries.
    • Sparse?

      • Josh: see Eric below. Expect to see a PR eventually.
      • Davis: how is that not a codec?
      • Martin: more a convention concern. Working with the awkward arrays.
    • PyData Global!

      • Talk called “All you need is Zarr”
      • many in the HDF5 community are interested in Zarr+Kerchunk to get cloud friendly storage. Basically a translation layer. Likely to need more codecs.
      • Josh: strong relationship to the sharding work.
      • Martin: metadata tricks can get you a lot without too many downsides. (Even though you’ll need to scan your data)
      • May be pushing on this
  • Ward

    • Unidata moving forward with solicitation (v3 + parallelization)

    • Letter of collaboration? Josh: sure. Template → Logo → Good.

  • Martin: pangeo meeting pointed to a concern of there being no R implementation

  • Greg: preliminaries for V3 implementation

  • Davis: re: dimension_separator travail

    • Move keys from strings to tuple of strings

    • Playing around with that. But breaks the ability to use

      dictionaries for storage

    • i.e. perhaps using dict for storage is important

    • Advantage of splitting prefix & chunk key is nice.

    • How sacred is dictionary as a store?

    • Josh: or a Key object?

    • Dennis: nczarr uses tuples internally with on-the-fly conversion

      of tuple ot string.

      • Gotcha was converting from string to tuple (need heuristics)
      • Davis: when is that needed? in S3 you ask for all strings below a certain point. accessing this individually requires parsing them to tuples.
      • Josh: the “Hierarchy”-style with a cached test of .zarray existence would also work.
      • Dennis: that’s preferable.