20th April, 2022
Attending: Ryan Abernathey, Josh Moore, Eric Perlman, Sanket Verma, Jeremy Maitn-Shepard, Jonathan Striebel, Gregory Lee, Jim Pivarski, Ishan Bansal, Isaac Virshup, Parth Tripathi, Martin Durant, Ward Fisher, Dennis Heimbigner, Matthew McCormick
Updates:
- GSoC deadline ended, we’ve 3 proposals this year! April 19-May 12 we can decide how many slots we can take.
- Cloud Native Outreach Event went great. Videos will be live shortly!
- If you have any videos to share, let us know!
- Using https://github.com/orgs/zarr-developers/discussions
- higher level of repo discussions (specifically show up on the “community” repository.)
- ZEP final update!
- JM: implementation council to be invited
- JS: great to have the implementors on board to not fragment the landscape
- MD: some may not implement though, right? JM: true. multiple states of votes:
- will-implement, may-implement, wont-implement, breaks-us-veto
- MD: no clear status of what’s up to date
- RA: veto power since would be bad to lead to forks. worth discussing that provision.
- MD: since we aim for consensus anyway (and veto is used rarely ) should work fine
- JS: don’t want to end in a place where the spec says something that will never be implemented.
- JS: separation on veto for core or extension. JM: agreed, focus all ZEPs on core for the moment
- SV: extensions are V3, which isn’t done, so it’s all core.
- JMS: only V3? JM: what about C/F order? RA: don’t have to limit it (but we want to focus on V3)
- MD: agreed, the place to expose breakages
- RA: core vs. extension
- is core something that everyone must (eventually) implement
- MD: some things are already optional like filter
- MD: extensions were originally synonymous with conventions but dataset is openable without
- RA: convention is distinct from optional extension (cf. variable length chunks)
- JMS: another way of seeing extensions is the evolution of the spec. signalling to implementations that they are seeing new data. “must understand”
- JP: agree about the disctinction. can’t-read-data vs. might-need-a-library. have wanted to frame this as an extension that labels a convention, like an annotation.
- IV: would be useful to specify convention. if you don’t have a way to store the metadata, then goes into the .zattrs
- good to have a field in the structural metadata to specify conventions
- JS: separate from convention. orthogonal questions.
- RA: hierarchy (or ontology)
- JP: A different example: I’ve seen HDF5 data files, from gravitational waves, that are valid HDF5 but can only be “understood” by the LIGO collaboration’s code. It would have been good to have a label on that HDF5 file warning haphazard users.
- Unidata?
- DH: is wont-implement a veto? or if they say wont-implement and breaks, then is veto? That’s a lot of power.
- JMS: non-zero origin & data-orders other than C are both examples that cause issues (with e.g. Julia). potential vetos.
- DH: solvable, but they are saying the cost is high.
- RA: take sharding. major enhancement but pita to implement. is it core? need to show in ZEP? higher bar for core proposal?
- DH: have been looking at how to implement. it will be a challenge. decided with Ward that it’s worth doing.
- RA: meta goal is to have that discussion before the ship has sailed.
- WF: feels a lot like an internal NetCDF conversation. What is an NC file? vs. what’s in the doc
- NC file has to be more than a file written by netcdf-c library (the first party implementation)
- goal of tech. spec. is to take it and write software in any lang. that can write/read a NC
- needs to specify permissible deviations
- MD: how to go from v3 to v3.1? (sharding or variable length chunks)
- WF: have made many mistakes…. (e.g always refer to specific versions, NetCDF 3…)
- note: unidata doesn’t yet have the iron clad backwards compatibility for nczarr
- v3 to v3.1 could potentially not be backwards compatible
- behavior versus definition (this message may self-destruct…)
- MD: parquet example. people mention v2 but that doesn’t really exist
- still features that aren’t implemented!
- JMS: similar in HTML , https://caniuse.com/ – we need the same
- JS: agreed, important to know. needed to read the data or not.
- Even more core, must-understand flag & warnings about not being supported. All MUST have this for V3.
- sharding storage-transformer proposal: sharding could be an extension that uses transformers.
- then impl. council decides
- WF: NetCDF isn’t a great analog here; we have no forward-compatibility promise, and the solution when an old version cannot read a newer file is to suggest they upgrade to the latest version. But this is because there are not a lot of independent implementations in the wild.
- NetCDF is also fortunate to have a number of independently-developed utilities and tools (NetCDF Operators (NCO), and pnetcdf spring to mind). Perhaps a zdump (similar to ncdump or h5dump), provided by the core project, that could provide summary information for a file? This information could then be used to determine if a specific dataset could be read by an implementation in question.
- RA: useful concept here – netcdf is focused on interoperability & preservation. Parquet is for performance. Zarr is mainly high-performance copy of data. But for sharing, might make different choices. Use different extensions then. i.e. have it both ways. Still need the minimal, most-operable version. And need to be clear & upfront about that.
- MD: perhaps a “maximal-flag” setting? IV: perhaps flags. JMS: agreed.
- MD: perhaps “conversative” to cover several of these
- JM: https://xmpp.org/extensions/xep-0115.html
- JP: version/flag-objects within spec, then could have storage-typed and performance-typed objects
- IV: do that in AnnData. Sparse array v1 or v2. (i.e. at the object level)
- RA: are we at the point that the core is the necessary stuff in v3 and we can go with that?
- JMS: expect to add non-optional features in the future
- JM: various data types are probably missing now
- JS: storage transformer falls under this too
- JMS: (…Josh missed a comment from JMS here…)
- JS: not clear how optional the extensions are.
- JM: strip extensions and add it back later? Agreed.