2023-02-16

Attending: Sanket Verma (SV), Dieu My Nguyen (DMN), John A. Kirkham (JK), Hailiang Zhang (HZ), Johana Chazaro (JC), Jeremy Maitin-Shepard (JMS), Akshay Subramaniam (AS)

TL;DR:

Jonathan Striebel laid out some discussion points before the meeting to, look at. Unfortunately, he couldn’t make it to the meeting, and we decided to work on those points asynchronously via GitHub. Then SV asked everyone for their thoughts on checksums for the shard. Finally, HZ gave a summary of his newly submitted ZEP, which can be seen here.

Points of discussions:

ZEP 1:

Meeting Minutes:

  • SV: Your thoughts on checksum for shards? Check the discussion here
    • AS: Not really thought about the ZEP extension but it could be!
    • AS: Want to support applications that don’t support Zarr - would be nice to support shard - send shard over the network and decompress it over at the other end
    • AS: KwickIO doesn’t do compression - would be nice to support this
    • JMS: There’s some tension with the Zarr model - shard has data and metadata - maybe duplicate the metadata and add info over to it
    • AS: checksum issue is not critical - some more metadata to shard - applications in genomics and geospatial data - having number of chunks would help - applications has the context for unpacking
    • AS: Zarr can have wrapper which can put data in the right place
    • JMS: having a container of the string is the abstract of what we want
    • JK: Is it depended on the data? - The compressor and the chunks and shard
    • AS: what compressor works with what data is a subjective choice
    • JK: Compressor could have branching logic?
    • AS: Could be logical to create a new compressor
    • JK: Branching logic would change for different datasets?
    • AS: It could.
    • SV: Dataset is public? Or can be made public?
    • AS: We can make it public - I’ll look into it
  • HZ: Brief summary of ZEP0005 GES DISC is looking for averaging the chunks - cost is high - introduce the algorithm - make overhead to be small - regional data is 1 TB the extra overhead after accumulation would be ~5% - it is working really well - big improvement in performance - very accurate - Ryan suggested to add it as extension during last year ESIP meeting
    • SV: Maybe add benchmarks?
    • HZ: We can do that!
    • JMS: Question on the specs PR
    • HZ: User can specify can any random range - corner would aggregate the chunk value - loading single chunk is easy - this single chunk would contain the aggregation value and you load it - it would be transparent to the user
    • JMS: makes sense to have stride - for image you want to store pyramid - you can represent downsampling pyramid - you can do this by your proposal
    • HZ: if you can zoom in and zoom out - regular stride - how do we setup the stride? - in theory is it possible? - is it possible to have a version 2 for this extension ZEP? - the separation would be good idea
    • JMS: Multiply stride by chunk size
    • HZ: Saving multiple chunks is not a problem - currently doesn’t support any random stride
    • JMS: It would let implementation have more work but it would cover more generic use cases
    • HZ: I will think hard and include the modification in the PR