20th April, 2022

Attending: Ryan Abernathey, Josh Moore, Eric Perlman, Sanket Verma, Jeremy Maitn-Shepard, Jonathan Striebel, Gregory Lee, Jim Pivarski, Ishan Bansal, Isaac Virshup, Parth Tripathi, Martin Durant, Ward Fisher, Dennis Heimbigner, Matthew McCormick

Updates:

  • GSoC deadline ended, we’ve 3 proposals this year! April 19-May 12 we can decide how many slots we can take.
  • Cloud Native Outreach Event went great. Videos will be live shortly!
  • If you have any videos to share, let us know!
  • Using https://github.com/orgs/zarr-developers/discussions
    • higher level of repo discussions (specifically show up on the “community” repository.)
  • ZEP final update!
    • JM: implementation council to be invited
    • JS: great to have the implementors on board to not fragment the landscape
    • MD: some may not implement though, right? JM: true. multiple states of votes:
      • will-implement, may-implement, wont-implement, breaks-us-veto
    • MD: no clear status of what’s up to date
    • RA: veto power since would be bad to lead to forks. worth discussing that provision.
      • MD: since we aim for consensus anyway (and veto is used rarely ) should work fine
      • JS: don’t want to end in a place where the spec says something that will never be implemented.
      • JS: separation on veto for core or extension. JM: agreed, focus all ZEPs on core for the moment
      • SV: extensions are V3, which isn’t done, so it’s all core.
      • JMS: only V3? JM: what about C/F order? RA: don’t have to limit it (but we want to focus on V3)
        • MD: agreed, the place to expose breakages
    • RA: core vs. extension
      • is core something that everyone must (eventually) implement
      • MD: some things are already optional like filter
      • MD: extensions were originally synonymous with conventions but dataset is openable without
      • RA: convention is distinct from optional extension (cf. variable length chunks)
      • JMS: another way of seeing extensions is the evolution of the spec. signalling to implementations that they are seeing new data. “must understand”
      • JP: agree about the disctinction. can’t-read-data vs. might-need-a-library. have wanted to frame this as an extension that labels a convention, like an annotation.
      • IV: would be useful to specify convention. if you don’t have a way to store the metadata, then goes into the .zattrs
        • good to have a field in the structural metadata to specify conventions
        • JS: separate from convention. orthogonal questions.
      • RA: hierarchy (or ontology)
      • JP: A different example: I’ve seen HDF5 data files, from gravitational waves, that are valid HDF5 but can only be “understood” by the LIGO collaboration’s code. It would have been good to have a label on that HDF5 file warning haphazard users.
      • Unidata?
        • DH: is wont-implement a veto? or if they say wont-implement and breaks, then is veto? That’s a lot of power.
        • JMS: non-zero origin & data-orders other than C are both examples that cause issues (with e.g. Julia). potential vetos.
        • DH: solvable, but they are saying the cost is high.
        • RA: take sharding. major enhancement but pita to implement. is it core? need to show in ZEP? higher bar for core proposal?
        • DH: have been looking at how to implement. it will be a challenge. decided with Ward that it’s worth doing.
        • RA: meta goal is to have that discussion before the ship has sailed.
        • WF: feels a lot like an internal NetCDF conversation. What is an NC file? vs. what’s in the doc
          • NC file has to be more than a file written by netcdf-c library (the first party implementation)
          • goal of tech. spec. is to take it and write software in any lang. that can write/read a NC
          • needs to specify permissible deviations
        • MD: how to go from v3 to v3.1? (sharding or variable length chunks)
          • WF: have made many mistakes…. (e.g always refer to specific versions, NetCDF 3…)
          • note: unidata doesn’t yet have the iron clad backwards compatibility for nczarr
          • v3 to v3.1 could potentially not be backwards compatible
          • behavior versus definition (this message may self-destruct…)
      • MD: parquet example. people mention v2 but that doesn’t really exist
        • still features that aren’t implemented!
      • JMS: similar in HTML , https://caniuse.com/ – we need the same
      • JS: agreed, important to know. needed to read the data or not.
        • Even more core, must-understand flag & warnings about not being supported. All MUST have this for V3.
        • sharding storage-transformer proposal: sharding could be an extension that uses transformers.
        • then impl. council decides
      • WF: NetCDF isn’t a great analog here; we have no forward-compatibility promise, and the solution when an old version cannot read a newer file is to suggest they upgrade to the latest version. But this is because there are not a lot of independent implementations in the wild.
        • NetCDF is also fortunate to have a number of independently-developed utilities and tools (NetCDF Operators (NCO), and pnetcdf spring to mind). Perhaps a zdump (similar to ncdump or h5dump), provided by the core project, that could provide summary information for a file? This information could then be used to determine if a specific dataset could be read by an implementation in question.
      • RA: useful concept here – netcdf is focused on interoperability & preservation. Parquet is for performance. Zarr is mainly high-performance copy of data. But for sharing, might make different choices. Use different extensions then. i.e. have it both ways. Still need the minimal, most-operable version. And need to be clear & upfront about that.
      • MD: perhaps a “maximal-flag” setting? IV: perhaps flags. JMS: agreed.
        • MD: perhaps “conversative” to cover several of these
      • JM: https://xmpp.org/extensions/xep-0115.html
      • JP: version/flag-objects within spec, then could have storage-typed and performance-typed objects
      • IV: do that in AnnData. Sparse array v1 or v2. (i.e. at the object level)
      • RA: are we at the point that the core is the necessary stuff in v3 and we can go with that?
      • JMS: expect to add non-optional features in the future
      • JM: various data types are probably missing now
      • JS: storage transformer falls under this too
      • JMS: (…Josh missed a comment from JMS here…)
      • JS: not clear how optional the extensions are.
      • JM: strip extensions and add it back later? Agreed.