27th February, 2019

Date and time: 20:00 GMT

Joining instructions: https://zoom.us/j/300670033 (see Full Zoom instructions)

Intend to attend: Josh Moore, Ryan Williams, Constantin Pape, Alistair Miles

Attendees: JoshM, RyanA, AlistairM, ConstantinP, WardF, JohnK, RyanW, StephanS, Jan, MartinD, TonyT

Please feel free to propose an item for the agenda below. If you do, please include your name so we know who proposed the item.

Standing items

  • Introductions: new joinees should fill welcome to explain why they are interested in joining.

  • Ward Fisher, UNIDATA, NetCDF team lead.
  • Tony T - CZI engineer, working on starfish.

Spec proposals ~20:05 UK

(Agenda item proposed by Josh Moore.)

Based on the example of Zarr N5 spec diff:

  • What are the steps towards making a spec proposal? (timelines, process)

    • PRs against this repo? (cF. filter branch)

  • More concretely, are there opinions on options to attempt here?

    • Unify? Minimal vs core specs?

Notes:

Josh: What are concrete steps for a spec proposal? Do we create a version file? RFC? Can’t yet deal with voting etc.

Constantin: What would be interesting, discuss how zarr and n5 will live together. Coexist as similar flavours? Converge on zarr/n5 and deprecate both versions?

Alistair: being clear about the scope & requirements if we’re modifying the core.

Josh: Having a single entity would be ideal. Layers would also be OK. Is it just a v3 of the spec?

…another idea, call a minimal spec based on core of n5 and zarr, then have a maximal spec with additional features. Then the NetCDF/ZDF spec is one layer higher

RyanA: Extensions are a good structure. First what is most basic zarr/n5 spec? Then what are extensions?

Alistair: modular spec framework as the first step? Part of that is related to the protocol (keys/values). Along with defining community conventions. And then can focus on the core spec and the rest is more decoupled.

Stephan: Like the layered idea. Approach this by looking at the existing APIs. E.g., know how to construct a key, know what’s behind it, know how to load a block. And how does that then translate into compression, and storage implementations. Basic framework of how to construct keys, and what are data primitives behind keys? Then what is a block, is it always an n-dimensional chunk, or just a blob?

Constantin: How is a block defined, e.g., includes a header.

Stephan: How is a block/chunk defined. Then think about what else could a chunk be? Always want to just read a complete chunk? (Like that actually.) Then next layer, what do I do with a chunk? Deflate into a bytestream? Always an n-dimensional container? How can we make it flexible?

Jan: Storage can just deal with blobs…

John: Challenge with filters, support currently for object codecs, used for text data, but would need to think about how to generalise that.

Josh: Some of the issues I’ve read, maybe push that out to an extension spec, not part of core spec. Or maybe could do it,

John: Is it being used for text data? Ragged arrays?

Martin: Similar, concepts in existing implementations. Key question, do people use those things now?

Alistair: comes back to capturing the requirements for the core spec. But would like to avoid heavy weight upfront. Personally (genomics) need to store variable length text strings and ragged arrays of primitive types which is by there’s currently support, but didn’t manage to get that properly captured in the spec. As well as: what does a chunk look like? Incomplete chunks?

Josh: how do ragged look in the N5 space? → Stephan: dunno. Likely doable, don’t know if it matches “core”.

Stephan: How to construct keys, what characters are allowed. Concepts of datasets and groups, agreed. Concept of user attributes for metadata. Don’t know about typed metadata, could be a lot of work.

…Re chunk, basic N-dimensional chunk of pixels, with support for all primitive types supported in most programming languages.

…And then should be room for extensions in the standard. E.g., for Python, store objects. Or for a Java person to stream objects. (Ryan: concur)

Constantin: Only difference between n5 and zarr is the keys.

Stephan: Also zarr is a bit more comprehensive, e.g., supports different byte order. N5 only supports network (java) order. Think it’s a good idea to support different byte orders, can make streaming data into memory faster.

Josh: We’ll try to create a v3 spec. Basically a cut and paste, with each of the requirements spelled out, with modifications, but still core. If anyone wants to propose an addition then can do via a PR, for discussion. Then extensions could be separate. Strategy for next couple of weeks, to copy v2 then add bullet points?

…Ward anything you can see in netzdf that needs to be in core zarr/n5 spec.

Ward: NetZDF somewhere between extended and classic, couldn’t do feature for feature to extended data model, e.g., ragged arrays. Nice for those features to come into this spec. Ideal if we can not introduce a new data model, just have the extant netcdf4 data model with a new storage model underneath it. Features like ragged arrays (vlens) aren’t used very often, nice to hear there are practical use cases.

…Typed attributes. We have a document that Dennis H wrote, will try to find and share.

RyanA: NetCDF model, user-defined types? (see https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html )

Ward: Yes, main concern is to bridge that with zarr spec. What’s come up, user-defined types are in extended model, but no-ones using enums or compound types, would be great to have them but not sure work is justified as rarely used.

Martin: Mentioned awkward_arrays has some interesting ideas for working with more unusual data structures.

John: Some example data from the different libraries in this repo.

Zarr/N5 compatibility - 20:30 UK

(Agenda item proposed by John Kirkham)

Core requirements

     
  Core or not core Notes
Byte order Core  
Layout (C/F)    
Chunk size   Vs. uniform chunk sizes
Edge chunks   I.e. chunk grid
Overlapping chunks    
Variable chunk sizes    
**VLen (ragged) **    
Appended arrays    
Prepended arrays   Mapping keys? (offsets)
Headers   Perhaps also optional?
Fill value    
Typed attributes    
Gzip    
…other filter stuff…    

Discussion

Alistair: explain variable length chunks? … Stephan: size is primarily beneficial for edge chunks. Varlength chunks is used to store pixel labels, since of arbitrary length. Spatial dimensions are still defined by the chunk size. (Similar to storing text) Alistair: different from storing variable length item in your chunk? (i.e. the number of items is the same size) Stephan: store it as a lookup table into a multi-set, though related to vlen. Jan: encode/decoder/filter on top of vlen arrays for producing the labels? Stephan: also

  • JohnK: some application to computing with dask

Stephan: Orthogonal concepts: grid layout of chunks; size (shape?) of chunks; layout of data within chunk.

…need to extend arrays? In positive and negative dimensions? I think should also support prepend and append.

Alistair: convert table to issues on zarr-specs so we can work through the various choices? (General nodding)

Stephan: Sometimes might want a header or not, could retain an option.

Alistair: variable length text arrays: numpy object arrays (each is a string). On encoding run through vlenutf8 encoder. cF. pandas discussion on suboptimality of object arrays. Stephan: would be like using a dictionary per chunk (compresses well) JohnK: similar in zarr might be a structured array. The records have different shapes. Store that as one record for the chunk. Alistair: but still has fixed storage layout. Martin: local vs. global encodings and how to make sense of them. Stephan: enc/dec are for chunk or items? Alistair: transformers for arbitrary sequences of bytes. Josh: additional blobs per chunk? Jan: or additional (parallel) datasets. Alistair: chunks as ND, then for each datatype you have canonical in memory representation. So far leveraged off of numpy, but other cases like pixels+LUTs. cF. ndtypes, arrow, etc. Does something exist? Stephan: sizes could change largely. RyanA: same for with compression. Leads to ragged arrays.

(Stephan, Jan, Ward leave. Tony left at some point, too.)

RyanA: articulate how Zarr is different from Parquet. Perhaps just largely the dimensionality. Alistair: there is working on encoding in Parquet. Martin: other encoders that are better. Alistair: tried (data, length) and (length, data) and compared compression. Beyond numpy, what’s the best encoding for arrays of text. Martin: cF. https://github.com/scikit-hep/awkward-array and https://root.cern.ch/ . Offsets for text is a good way to go.

Library (performance) requirements

  • Imglib2
  • dask

Other topics

  • FYI (Ward): prepping a release of netcdf (4.6.3) today/tomorrow

  • Any status update on governance templates?

    • Alistair: Definite aspirations to look into it. Part of it is

      the spec development process. Focus on that first. Then the social agreements side can come later.

    • @@Constantin: +1 for opening issues. Will try to branch off

      issues, especially for the chunk related things:
      https://github.com/zarr-developers/zarr-specs/issues/7

    • JohnK: example repository -

      https://github.com/constantinpape/zarr_implementations
      Constantin: Will clean up and then transfer ownership to zarr-developers

  • Other communication media?

  • Chair for March 13th?

    • Alistair to find someone on the issue unless ….

      • Can’t in March: Constantin, John K.
      • Can Ward or Stephan lead next time with their technical expertise?
      • RyanW in 4 weeks might be good. (can potentially include 20-30 minutes next time)