Skip to main content Link Search Menu Expand Document (external link)

1st July, 2020

Attending: Josh, Dennis, Ryan W, Ward, Matthias, Ryan A, John, Alistair, Matt,

Andrew, Stephan Hoyer

Apologies: Martin (Happy Canada Day!)

  • Standing items

    • Introductions: new joinees should feel welcome to introduce

      themselves and their interest in zarr/n5.

  • Add Community Topics Here

  • FYI / 1 minute update

    • Josh: Still trying to convert

      hdf2zarr without too much success, plus fighting with bool encodings and 403 for remote chunks

    • HDF5Zarr -

      viewing HDF5 files as Zarr

      • Josh: was an issue of supporting types

      • General question of: should every HDF5 file be convertible?

      • John: attributes need to be json serializable, so arrays aren’t supported.

        • Could define a mapping, but there are likely performance

          tradeoffs.

      • John: similar issues in xarray? Ryan A: haven’t looked at that part of the code base. Stephan Hoyer knows. (wrote backends for xarray)

      • Dennis: what’s the use case? Unsure, but generally a small numpy that’s been stored as an attribute rather than as a dataset.

      • Stephan joins (repeat

      • Questions to Stephan

        • Should we be able to convert any HDF5 files in ZARR, is

          there an isomorphism.

        • Are there any changes to the zarr spec to make that

          easier.

        • String is one of the harder datatypes. Particularly

          unicode strings,

      • SH: handling strings, especially unicode strings, is one of the trickier things. HDF has a sane model for strings (always encoding attached).

        • Options:

          • ASCII
          • UTF-8
        • Fixed width & variable width data type, in terms of

          encoded string.

        • Xarray can write any of that. No native way to write

          encoded UTF-8 string in Python. No numpy datatype. Unicode strings are 4-bytes. Or Object strings.

        • Xarray writes into (variable width) numpy object arrays.

          No type safety. Similar to pandas. Works reasonably well in practice.

        • In Zarr xarray needs to choose an encoding. Need to

          review, but use default python encoding. Doesn’t use

        • Matthias: isn’t UTF-8 variable width?

        • Stephan: yes, but it’s the max width of the encoded

          size. Cf. Julia model. (whereas Python is more UTF-32-like)

        • Dennis: can also null terminate (even if not legal)
      • Josh: worth trying H5Zarr [bypasses netcdf library] ? SH: Challenge is that xarray requires the named dimensions. As long as you have dimension scales, then it should be able to convert.

      • SH: having a built string type would be worth thinking about (only at the numcodecs layer)

      • Alistair: certainly thinking that that would be a high-priority extension (Text). Copy HDF5 model? SH: for fixed-width data, it’s a pretty good model. Alistair: and for variable width strings? SH: Also good, but more complex. Depends on the complexity of variable width data in zarr in general.

      • John: does storing arrays in attributes happen in xarray? SH: yes, but I don’t think it’s very useful.

      • Dennis:

        • Attributes are one dimensional arrays of variable length

        • Attributes are type (not in Zarr; hence the extra

          metadata)

  • Add Core Topics Here

    • Matthias Questions:

      • Which approach do you prefer for zarr v3:

        • A complete separate import submodule and utility

          functions ?

        • Try to weave v3 in the current classes and utilities

          functions ?

          • Some refactor of v2 will likely be needed (and stared), in particular try to be a bit stricter on internal types would help.also around attributes that are currently their own key, and would not be.
        • Ryan A: in xarray, passes thing to `open_store` (never

          open groups, though it may be changing to use introspection

        • PR in progress:

          https://github.com/pydata/xarray/pull/4187

        • Josh: an use of namespaces with {zarr, zarr2, zarr3}

        • Matthias: concerned about the isinstance() uses., but

          depends on what API usage you allow

        • Matthias: almost have xarray loading v3 spec without

          touching xarray

        • RyanA: could add a new CI branch to pull the V3 spec

        • Matthias: cool, but currently choking on attributes.

          Does work with a V2 adapter.

Image 3

    • From Martin in absentia: “*I can report that fsspec’s mapper

      class has acquired the methods “getitems”, “setitems” and “delitems”, which will be implemented in async/concurrent when possible. This already works for HTTP (read-only), and I am doing GCS/S3 next.*”

    • SH: been using Zarr recently. Very enjoyable. Kudos!