Skip to main content Link Search Menu Expand Document (external link)

20th May, 2020

Attending: Dennis, Chris Roat, Ryan Abernathey, Josh Moore, Matthias Bussonnier,

John Kirkham, Stephan Wagner-Conrad (Zeiss), Ralf Gommers,

Nicholas Sofroniew, Andrew ?, Ryan Williams, Ben Dichter, Matt McCormick,

Alistair Miles, Martin Durant, Jonathan Lefman, James Bourbeau,

Apologies:

  • Standing items

    • Introductions: new joinees should feel welcome to introduce

      themselves and their interest in zarr/n5.

    • Stephan W-C (Zeiss): have wanted to join for some time.

      Interested in using Zarr for microscopy, and possibly support it moving forward.

    • Chris Roat (?): started 8-9 months ago, had heard

      xarray/zarr/dask for parallel computing. Also use neuroglancer in the backend.

    • Tyler: work with Chris as a PhD student. Reluctant HDF5 user.

      Interested in cross-language support. (e.g. MATLAB, Julia) Ryan A: julia supports zarr, but not feature parity. Matthias: could import Python until feature complete.

    • Andrew Fulton (Quansight): introduced by James.

    • Matthias (Quansight): around for 3 weeks. Working on v3 spec.

      And fixing bugs. Performance optimizing.

    • Ben Dichter (NWB): HDF5 as primary backend. Exploring Zarr as a

      secondary backend.

    • Matt McCormick: work with Ben, some NWB but imaging in general.

    • …old folks introduce themselves…
  • NWB: Evaluate Zarr as a possible alternate array data / storage backend - 20:11 UK

    • Ben: goal of NWB is to create entire neurophysiology experiment

      (h/w, measurements, …) i

    • Resources:

    • .e. recording live animal activity with microscopy/probes

      • Data typically for one experiment. Nearly impossible to share data. Often easier to re-record. Trying to solve that problem.

      • Technically: (1) could just use zarr as a new backend. HDF5 works well with links, region references, etc. Timeline for those features? (2) reading HDF5 file using zarr.

      • Ryan: might want to to check out Cloud-Performant Reading of NetCDF4/HDF5 Data Using the Zarr Library

      • Matt: are there bindings in MATLAB?

        • Nick: is anyone talking to MATLAB? Jonathan: can also

          link up from the NVIDIA side.

        • Josh: would be great to have MATLAB reading zarr but

          likely C layer would be the best route → nczarr?

        • Dennis: R, MATLAB have often relied on NetCDF to do the

          heavy lifting.

      • John: looked at links or region references via nczarr? Dennis: if it’s not in the spec then NetCDF won’t implement it. I.e. being narrow in the implementation or spec-driven.

        • John: could conceivably use symlinks with

          DirectoryStore. But cloud?

        • Martin: no general mechanism

        • Dennis: particular extension for HDF5?

        • some existing discussion of links/references in Zarr:

          #389

        • John: in cloud could potentially have metadata that

          points to the data, but wouldn’t work for hard references.

        • Dennis: have avoided links that form a cycle

        • Martin: NetCDF is a nicer thing to mirror than all of

          HDF5.

        • Ben on the use of links: workaround would probably be

          ok. E.g. have a generic time-series structure. Don’t want the user to duplicate the timestamps but it looks like they exist in many places.

        • Ben: also use references as attributes extensively.

          Table structure which corresponds to putative neurons. Want to link those to the electrodes. An attribute on one of the datasets to a group formalizes that link.

        • RyanA: xarray allows this at least currently in a single

          group. Ben: that would currently be a major refactor. RyanA: no easier answer, doesn’t currently go up or down the tree.

        • John: using hard links? NAFAIK, but *also *external

          links.

      • Alistair: if there’s a way to achieve what you want without adding protocol extensions, worth exploring (depending on cost). But if you need internal soft-links or region references, then those are likely implementable as extensions. Defining conventions for storing it in metadata. Could explore doing so as extensions to v2 and getting those added to zarr-python as a short-term option. Likely no core bandwidth for that since there’s a focus on the v3 spec. Ben: and v3? May be easier, since v2 doesn’t have a concept of the root of a hierarchy making relative links difficult. V3 is ongoing for the rest of this year.

  • DMH: I have put up some documentation for the nczarr netcdf implementation here (I hope it is accessible): NCZARR.md

    • RPA: I made a hack-md version:

      https://hackmd.io/@rabernat/r1yYQZmsU

    • RPA: Done or up for discussion? DMH: Up for discussion. First

      experimental implementation (stake in the ground). Also want to move to v3 spec as soon as possible, but want to surface problems as soon as possible.

  • RPA (5 min): document xarray Zarr encoding conventions:
    http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification

    • RyanA: xarray requires this at the moment, but was the only

      extension.

    • Josh: NB- will soon reach the same issue in (light-)microscopy

      space but won’t be purely in the xarray space.

    • Dennis: you can simulate it for a zarr that’s missing it.

      Implies sameness where there may not be any semantically. E.g. “dim10”. Considered going to an attribute for this. Decided to not do that due to scalability. Basically removes laziness due to how zarr stores attributes. Used a separate object.

      • RyanA: good point. Known to be a hack. But have a petabyte of data formatted like this, will nczarr be able to read it?

      • Dennis: Have many hacks for supporting the diversity of OpenDAP servers. Willing to add special cases if needed for backwards compatibility.

      • RyanA: realized that we made a mistake in just hard-coding without specifying. Not particularly interoperable, but now trying to get there. Hopefully superceded by future specs.

      • Dennis: need to be careful about naming. NetCDF tries to interpret underscore attributes. Next time perhaps consider a new namespace.

      • RyanA: should make named dimensions as simple/interoperable as possible.

      • Dennis: would guess that’s a top candidate for adding to the v3 spec.

      • Alistair: core metadata? Additionally, v3 puts attributes along with array metadata, based on what n5 was doing. Now is the time to challenge that. DMH: challenge! due to scalability. E.g. reading 10Ks of file for very specific information, like a coordinate. Dennis: was mostly worried about the consolidated approach.

      • Matthias: storing attributes in even more files?… Dennis: that then proliferates objects. Alistair: trade-off in cloud storage in latency.

      • RyanA: difference in use of zarr and netcdf – take 1000s of netcdfs and turn it into 1 zarr fileset. Somewhat mitigates problem of reading thousands of different files.

      • DMH: thought dataset was a bucket.

  • RPA (5 min): demo of rechunker tool:
    https://github.com/rabernat/rechunker/
    video: https://www.youtube.com/watch?v=ppNyHOC-FNI&feature=youtu.be

    • (notes only) Josh: have been watching the rechunking threads.

      Will give this a try for terabyte scale image arrays.

  • Matthias: Some questions about spec v3, fallback dtype for extensions … Should I start work with on a draft implementation that we update as the spec goes.