2022-09-22
Attending: Ward Fisher (WF), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS), Dennis Heimbigner (DH)
TL;DR
Consolidate metadata needs an extension for V3, which might result in a new ZEP. Next, JMS shared a document titled ‘Optionally-cooperative distributed b-tree for Tensorstore’. The participants discussed the document after that. After that, JM initiated the discussion on codecs-registry, which was built by one of the GSoC students this summer. The meeting ended with a discussion on the path to the metadata files.
Meeting Minutes:
- Java/NetCDF side:
- JM: Sanket met people
- WF: Unidata should be 3x the staff.
- JM: perhaps starting with a kerchunk implementation?
- WF: looking for more community involvement (like netcdf-c had)
- JM: Greg mentioned consolidated metadata needs an extension for V3
- RA: Iceberg issue, also see JMS’ proposal
- JMS: touches on not needing a file per chunk (like discussed last night)
- https://docs.google.com/document/d/1PLfyjtCnfJRr-zcWSxKy-gxgHJHSZvJ2y4C3JEHRwkQ/edit?resourcekey=0-o0JdDnC44cJ0FfT8K6U2pw#heading=h.8g8ih69qb0v
- db format that stores a btree.
- uniquely: designed to allow distributed writes (s3, etc.) but doesn’t need a peristent database
- can also read it in a non-distributed fashion
- downside: adds quite a bit of added complexity (greatly for binary format)
- also good where sharding isn’t appropriate (e.g. pre-defined shard size which is required for write)
- e.g. large number of small arrays (where sharding won’t help)
- RA: nice document. comments:
- focused on big distributed writes, but with iceberg had a different main motivation: more flexibility in mapping keys to chunks. kerchunk-like. virtual concatenate . can you reference random chunks? yes.
- JMS: btree nodes have references to files (like kerchunk). but datafiles are identified with 128-bit path (not an fsspec URL)
- RA: different use case, so can have them be optional transformers/extensions
- RA: really similar to tiledb! why not use it?
- JMS: tiledb is organized by time not space.
- JM: need a compaction
- JMS: and even after that you still have a million files.
- DH: HDF5? internally it’s btrees. (which is responsible for most of its complexity). Are you sure this is the path?
- JMS: not sure there’s an alternative to btrees. used in databases, filesystems, etc.
- DH: if you don’t want some ordered searches, then linear hashes are an alternative
- JMS: ordered is useful for a lot of use cases. but there wasn’t an obvious solution for distributed writes
- DH: extendable hashing is an easier data structure (old paper) works well with disk storage.
- JMS: think this is more a key-value store (like zip)
- RA: agreed. Nice that it’s possible to experiment like this.
- RA: can the V3 spec support this experimentation? (right extension points?)
- RA: trying to do that with Iceberg. Martin suggested “IceChunk”.
- See also: hooty and others. Lots of smart ideas that we can copy.
- Goal is to provide some level of branching & transactions for/on a Zarr store
- Allow you to work on your staged area which all get written at once.
- Branch non-destructively (or rollback)
- The key is having a “manifest” (they all have some concept of that, even kerchunk)
- Don’t depend on the object stores listing as the source of truth
- Need storage transformers at the top level, not array. But for JMS’ idea array-level might suffice.
- JMS: wasn’t planning on an extension. root metadata would be in the same data store.
- JM: basically writing DB/filesystem :+1: ZarrFS ;)
- JMS: planning on mongo? Yeah, or Dynamo. (They store JSON)
- JSON in S3 isn’t ideal.
- metadata in document store and chunks on disk. Beyond just filesystem. It’s a data lake.
- “meta-store”
- JMS: regarding versioning, how are you representing the delta?
- The chunk is the minimal writable unit. (out-of-scope)
- Every chunk write is a uniquely ID’d (e.g. content addressable). That gets a key. Write that to DB.
- JMS: expecting the database to provide the versioning?
- RA: no, just a place for documents. versioning (in iceberg) has a branch or a tag that points to a specific chunk manifest. you can create a new one and point your HEAD at that. only rely on database to atomically change the references. iceberg tracks a number for the transaction.
- JMS: use kerchunk model? limitation on the number of chunks?
- RA: chunks are likely in a separate manifest. discussed that another extension with Martin.
- RA: but can just query a chunk from the database.
- JMS: 1M chunks in v1. then update to v2. What’s the diff? A copy.
- RA: yeah need to play with it.
- JMS: when you get to wanting to update just a portion of it, then you get to b-trees :smile:
- RA: no db guys, trying to keep it hackable.
- RA: but megabyte kerchunk is already getting :heart: since it’s so easy. looking for incremental improvement on that. (NASA will be pumping out GRIB forever…)
- JMS: looking forward to hearing more and exchanging info re: b-trees
- JMS: see also https://github.com/janelia-flyem/dvid (backed by KV database)
- JM: sharing layers with them?
- JMS: complicated by other priorities of the EM team. invite Bill to the Zarr meetings?
- RA: see https://lakefs.io/
- JM: API versus format
- RA: thinking about it more like an API
- focused on big distributed writes, but with iceberg had a different main motivation: more flexibility in mapping keys to chunks. kerchunk-like. virtual concatenate . can you reference random chunks? yes.
- JM: briefly codecs-registry
- https://zarr.dev/codecs-registry/
- https://github.com/zarr-developers/codecs-registry
- JMS: still want a schema per codec. JM: agreed!
- JMS: talks about codecs having URLs.
- would by an annoyance to have difference V2 and V3 identifiers.
- e.g. just numeric constants in the JSON that are from the C API
- e.g. shuffle parameter which would be nicer as a string.
- support integer or string for a while (in order to deprecate)
- JM: have plans to have code in each languages that checks for an id from the central registry
- DH: approx. that with nczarr. ncdump lists the actual codecs in the file
- would be good to have something more sophisticated
- have the disadvantage of C code and interpreted files
- 3 repositories on the C side. unidata + irvine + hdf5
- hdf5 only has names, hdf5-ids and a pointer (which is often out of date)
- something universal would be nice
- WF: roping in the HDF5 group would be a heavy lift
- JMS: URL interface :rocket:
- DH: :+1: for the REST API
- WF: NSF/CSSI solicitation has opened
- https://beta.nsf.gov/funding/opportunities/cyberinfrastructure-sustained-scientific-innovation-cssi
- perhaps something here
- WF: planning on getting to https://www.egu23.eu/
- tweet something from zarr_dev to see if there is interest :question:
- could collaborate something re: nc/zarr
- JMS: don’t have clear resolution on the paths to the metadata files
- JM: re-capped the previous discussion and think it’s still good.
- JMS: some details around the root array (the named files, etc.)
- JMS: consolidated metadata? duplicated?
- JM: would make it possible to have everything in the top-level
- JMS: pointers in the subdirectories? bit annoying.
- JMS: with iceberg & co. you likely don’t need a consolidated metadata
- JM: so you’d push it to the store level?
- JMS: possibly, but not that simple
- JMS: there are cases where you need path separation anyway (Zips)
- JMS: so could see using a path separation strategy entirely
- JMS: Davis did have a use case …
- (…details zip, consolidated brainstorming…)
- JM: need both solutions…