2022-09-22

Attending: Ward Fisher (WF), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS), Dennis Heimbigner (DH)

TL;DR

Consolidate metadata needs an extension for V3, which might result in a new ZEP. Next, JMS shared a document titled ‘Optionally-cooperative distributed b-tree for Tensorstore’. The participants discussed the document after that. After that, JM initiated the discussion on codecs-registry, which was built by one of the GSoC students this summer. The meeting ended with a discussion on the path to the metadata files.

Meeting Minutes:

Java/NetCDF side:
- JM: Sanket met people
- WF: Unidata should be 3x the staff.
- JM: perhaps starting with a kerchunk implementation?
- WF: looking for more community involvement (like netcdf-c had)
JM: Greg mentioned consolidated metadata needs an extension for V3
RA: Iceberg issue, also see JMS’ proposal
- https://github.com/zarr-developers/zarr-specs/issues/154
JMS: touches on not needing a file per chunk (like discussed last night)
- https://docs.google.com/document/d/1PLfyjtCnfJRr-zcWSxKy-gxgHJHSZvJ2y4C3JEHRwkQ/edit?resourcekey=0-o0JdDnC44cJ0FfT8K6U2pw#heading=h.8g8ih69qb0v
- db format that stores a btree.
- uniquely: designed to allow distributed writes (s3, etc.) but doesn’t need a peristent database
  - can also read it in a non-distributed fashion
  - downside: adds quite a bit of added complexity (greatly for binary format)
  - also good where sharding isn’t appropriate (e.g. pre-defined shard size which is required for write)
    - e.g. large number of small arrays (where sharding won’t help)
- RA: nice document. comments:
  - focused on big distributed writes, but with iceberg had a different main motivation: more flexibility in mapping keys to chunks. kerchunk-like. virtual concatenate . can you reference random chunks? yes.
    - JMS: btree nodes have references to files (like kerchunk). but datafiles are identified with 128-bit path (not an fsspec URL)
    - RA: different use case, so can have them be optional transformers/extensions
  - RA: really similar to tiledb! why not use it?
    - JMS: tiledb is organized by time not space.
    - JM: need a compaction
    - JMS: and even after that you still have a million files.
  - DH: HDF5? internally it’s btrees. (which is responsible for most of its complexity). Are you sure this is the path?
    - JMS: not sure there’s an alternative to btrees. used in databases, filesystems, etc.
    - DH: if you don’t want some ordered searches, then linear hashes are an alternative
    - JMS: ordered is useful for a lot of use cases. but there wasn’t an obvious solution for distributed writes
    - DH: extendable hashing is an easier data structure (old paper) works well with disk storage.
  - JMS: think this is more a key-value store (like zip)
    - RA: agreed. Nice that it’s possible to experiment like this.
    - RA: can the V3 spec support this experimentation? (right extension points?)
    - RA: trying to do that with Iceberg. Martin suggested “IceChunk”.
      - See also: hooty and others. Lots of smart ideas that we can copy.
      - Goal is to provide some level of branching & transactions for/on a Zarr store
      - Allow you to work on your staged area which all get written at once.
      - Branch non-destructively (or rollback)
      - The key is having a “manifest” (they all have some concept of that, even kerchunk)
      - Don’t depend on the object stores listing as the source of truth
      - Need storage transformers at the top level, not array. But for JMS’ idea array-level might suffice.
      - JMS: wasn’t planning on an extension. root metadata would be in the same data store.
      - JM: basically writing DB/filesystem :+1: ZarrFS ;)
    - JMS: planning on mongo? Yeah, or Dynamo. (They store JSON)
      - JSON in S3 isn’t ideal.
      - metadata in document store and chunks on disk. Beyond just filesystem. It’s a data lake.
      - “meta-store”
  - JMS: regarding versioning, how are you representing the delta?
    - The chunk is the minimal writable unit. (out-of-scope)
    - Every chunk write is a uniquely ID’d (e.g. content addressable). That gets a key. Write that to DB.
    - JMS: expecting the database to provide the versioning?
    - RA: no, just a place for documents. versioning (in iceberg) has a branch or a tag that points to a specific chunk manifest. you can create a new one and point your HEAD at that. only rely on database to atomically change the references. iceberg tracks a number for the transaction.
    - JMS: use kerchunk model? limitation on the number of chunks?
    - RA: chunks are likely in a separate manifest. discussed that another extension with Martin.
    - RA: but can just query a chunk from the database.
    - JMS: 1M chunks in v1. then update to v2. What’s the diff? A copy.
    - RA: yeah need to play with it.
    - JMS: when you get to wanting to update just a portion of it, then you get to b-trees :smile:
    - RA: no db guys, trying to keep it hackable.
    - RA: but megabyte kerchunk is already getting :heart: since it’s so easy. looking for incremental improvement on that. (NASA will be pumping out GRIB forever…)
    - JMS: looking forward to hearing more and exchanging info re: b-trees
    - JMS: see also https://github.com/janelia-flyem/dvid (backed by KV database)
      - JM: sharing layers with them?
      - JMS: complicated by other priorities of the EM team. invite Bill to the Zarr meetings?
      - RA: see https://lakefs.io/
      - JM: API versus format
      - RA: thinking about it more like an API
JM: briefly codecs-registry
- https://zarr.dev/codecs-registry/
- https://github.com/zarr-developers/codecs-registry
- JMS: still want a schema per codec. JM: agreed!
- JMS: talks about codecs having URLs.
  - would by an annoyance to have difference V2 and V3 identifiers.
  - e.g. just numeric constants in the JSON that are from the C API
  - e.g. shuffle parameter which would be nicer as a string.
  - support integer or string for a while (in order to deprecate)
- JM: have plans to have code in each languages that checks for an id from the central registry
- DH: approx. that with nczarr. ncdump lists the actual codecs in the file
  - would be good to have something more sophisticated
  - have the disadvantage of C code and interpreted files
  - 3 repositories on the C side. unidata + irvine + hdf5
  - hdf5 only has names, hdf5-ids and a pointer (which is often out of date)
  - something universal would be nice
  - WF: roping in the HDF5 group would be a heavy lift
- JMS: URL interface :rocket:
  - DH: :+1: for the REST API
- WF: NSF/CSSI solicitation has opened
  - https://beta.nsf.gov/funding/opportunities/cyberinfrastructure-sustained-scientific-innovation-cssi
  - perhaps something here
  - WF: planning on getting to https://www.egu23.eu/
  - tweet something from zarr_dev to see if there is interest :question:
  - could collaborate something re: nc/zarr
JMS: don’t have clear resolution on the paths to the metadata files
- JM: re-capped the previous discussion and think it’s still good.
- JMS: some details around the root array (the named files, etc.)
- JMS: consolidated metadata? duplicated?
  - JM: would make it possible to have everything in the top-level
  - JMS: pointers in the subdirectories? bit annoying.
- JMS: with iceberg & co. you likely don’t need a consolidated metadata
- JM: so you’d push it to the store level?
- JMS: possibly, but not that simple
- JMS: there are cases where you need path separation anyway (Zips)
- JMS: so could see using a path separation strategy entirely
- JMS: Davis did have a use case …
- (…details zip, consolidated brainstorming…)
- JM: need both solutions…