2022-09-22

Attending: Ward Fisher (WF), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS), Dennis Heimbigner (DH)

TL;DR

Consolidate metadata needs an extension for V3, which might result in a new ZEP. Next, JMS shared a document titled ‘Optionally-cooperative distributed b-tree for Tensorstore’. The participants discussed the document after that. After that, JM initiated the discussion on codecs-registry, which was built by one of the GSoC students this summer. The meeting ended with a discussion on the path to the metadata files.

Meeting Minutes:

  • Java/NetCDF side:
    • JM: Sanket met people
    • WF: Unidata should be 3x the staff.
    • JM: perhaps starting with a kerchunk implementation?
    • WF: looking for more community involvement (like netcdf-c had)
  • JM: Greg mentioned consolidated metadata needs an extension for V3
  • RA: Iceberg issue, also see JMS’ proposal
  • JMS: touches on not needing a file per chunk (like discussed last night)
    • https://docs.google.com/document/d/1PLfyjtCnfJRr-zcWSxKy-gxgHJHSZvJ2y4C3JEHRwkQ/edit?resourcekey=0-o0JdDnC44cJ0FfT8K6U2pw#heading=h.8g8ih69qb0v
    • db format that stores a btree.
    • uniquely: designed to allow distributed writes (s3, etc.) but doesn’t need a peristent database
      • can also read it in a non-distributed fashion
      • downside: adds quite a bit of added complexity (greatly for binary format)
      • also good where sharding isn’t appropriate (e.g. pre-defined shard size which is required for write)
        • e.g. large number of small arrays (where sharding won’t help)
    • RA: nice document. comments:
      • focused on big distributed writes, but with iceberg had a different main motivation: more flexibility in mapping keys to chunks. kerchunk-like. virtual concatenate . can you reference random chunks? yes.
        • JMS: btree nodes have references to files (like kerchunk). but datafiles are identified with 128-bit path (not an fsspec URL)
        • RA: different use case, so can have them be optional transformers/extensions
      • RA: really similar to tiledb! why not use it?
        • JMS: tiledb is organized by time not space.
        • JM: need a compaction
        • JMS: and even after that you still have a million files.
      • DH: HDF5? internally it’s btrees. (which is responsible for most of its complexity). Are you sure this is the path?
        • JMS: not sure there’s an alternative to btrees. used in databases, filesystems, etc.
        • DH: if you don’t want some ordered searches, then linear hashes are an alternative
        • JMS: ordered is useful for a lot of use cases. but there wasn’t an obvious solution for distributed writes
        • DH: extendable hashing is an easier data structure (old paper) works well with disk storage.
      • JMS: think this is more a key-value store (like zip)
        • RA: agreed. Nice that it’s possible to experiment like this.
        • RA: can the V3 spec support this experimentation? (right extension points?)
        • RA: trying to do that with Iceberg. Martin suggested “IceChunk”.
          • See also: hooty and others. Lots of smart ideas that we can copy.
          • Goal is to provide some level of branching & transactions for/on a Zarr store
          • Allow you to work on your staged area which all get written at once.
          • Branch non-destructively (or rollback)
          • The key is having a “manifest” (they all have some concept of that, even kerchunk)
          • Don’t depend on the object stores listing as the source of truth
          • Need storage transformers at the top level, not array. But for JMS’ idea array-level might suffice.
          • JMS: wasn’t planning on an extension. root metadata would be in the same data store.
          • JM: basically writing DB/filesystem :+1: ZarrFS ;)
        • JMS: planning on mongo? Yeah, or Dynamo. (They store JSON)
          • JSON in S3 isn’t ideal.
          • metadata in document store and chunks on disk. Beyond just filesystem. It’s a data lake.
          • “meta-store”
      • JMS: regarding versioning, how are you representing the delta?
        • The chunk is the minimal writable unit. (out-of-scope)
        • Every chunk write is a uniquely ID’d (e.g. content addressable). That gets a key. Write that to DB.
        • JMS: expecting the database to provide the versioning?
        • RA: no, just a place for documents. versioning (in iceberg) has a branch or a tag that points to a specific chunk manifest. you can create a new one and point your HEAD at that. only rely on database to atomically change the references. iceberg tracks a number for the transaction.
        • JMS: use kerchunk model? limitation on the number of chunks?
        • RA: chunks are likely in a separate manifest. discussed that another extension with Martin.
        • RA: but can just query a chunk from the database.
        • JMS: 1M chunks in v1. then update to v2. What’s the diff? A copy.
        • RA: yeah need to play with it.
        • JMS: when you get to wanting to update just a portion of it, then you get to b-trees :smile:
        • RA: no db guys, trying to keep it hackable.
        • RA: but megabyte kerchunk is already getting :heart: since it’s so easy. looking for incremental improvement on that. (NASA will be pumping out GRIB forever…)
        • JMS: looking forward to hearing more and exchanging info re: b-trees
        • JMS: see also https://github.com/janelia-flyem/dvid (backed by KV database)
          • JM: sharing layers with them?
          • JMS: complicated by other priorities of the EM team. invite Bill to the Zarr meetings?
          • RA: see https://lakefs.io/
          • JM: API versus format
          • RA: thinking about it more like an API
  • JM: briefly codecs-registry
    • https://zarr.dev/codecs-registry/
    • https://github.com/zarr-developers/codecs-registry
    • JMS: still want a schema per codec. JM: agreed!
    • JMS: talks about codecs having URLs.
      • would by an annoyance to have difference V2 and V3 identifiers.
      • e.g. just numeric constants in the JSON that are from the C API
      • e.g. shuffle parameter which would be nicer as a string.
      • support integer or string for a while (in order to deprecate)
    • JM: have plans to have code in each languages that checks for an id from the central registry
    • DH: approx. that with nczarr. ncdump lists the actual codecs in the file
      • would be good to have something more sophisticated
      • have the disadvantage of C code and interpreted files
      • 3 repositories on the C side. unidata + irvine + hdf5
      • hdf5 only has names, hdf5-ids and a pointer (which is often out of date)
      • something universal would be nice
      • WF: roping in the HDF5 group would be a heavy lift
    • JMS: URL interface :rocket:
      • DH: :+1: for the REST API
    • WF: NSF/CSSI solicitation has opened
  • JMS: don’t have clear resolution on the paths to the metadata files
    • JM: re-capped the previous discussion and think it’s still good.
    • JMS: some details around the root array (the named files, etc.)
    • JMS: consolidated metadata? duplicated?
      • JM: would make it possible to have everything in the top-level
      • JMS: pointers in the subdirectories? bit annoying.
    • JMS: with iceberg & co. you likely don’t need a consolidated metadata
    • JM: so you’d push it to the store level?
    • JMS: possibly, but not that simple
    • JMS: there are cases where you need path separation anyway (Zips)
    • JMS: so could see using a path separation strategy entirely
    • JMS: Davis did have a use case …
    • (…details zip, consolidated brainstorming…)
    • JM: need both solutions…