2023-02-22

Attending: Davis Bennett (DB), Sanket Verma (SV), Ward Fisher (WF), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Norman Rzepka (NR), Eric Perlman (EP), Dieu My Nguyen (DMN), Virginia Scarlett (VS)

TL;DR:

We started the meeting by discussing some possible performance improvements for sharding. Ryan Abernathey did some benchmarks a few weeks ago, which turned out well. Then, the community offered a few solutions to make it even better. Jonathan is working on a PR for the same. After that, DB wondered, “How can he only store the indexes in the cache using Zarr?”

SV asked what does everyone think about the all-contributors bot. He also questioned whether the community would like to participate if Zarr were to organise a hack week. Both ideas were received 👍🏻 from everyone.

Lastly, DB showed us what he’s working on, and JMS discussed the zarr-specs issue he submitted recently.

Updates:

  • Zarr-Python 2.14.1 release with sharding and new docs theme (PyData-Sphinx)
  • SciPy 2023 deadline next week (03/01), if you want to collaborate on a tutorial with us, now’s the time

Meeting Notes:

  • DB: Sharding is slow - according to https://github.com/zarr-developers/zarr-python/issues/1343
    • NR: Jonathan is working on a PR
    • DB - Using mutable mapping for interface store for getitems and setitems is the probable cause - Zarr array API has some bash logic - which may also lead to slowing down
    • DH: What type of caching model is using?
    • NR: No caching is there! - individual chunks are loaded through byte range
    • DH: having cache would be fine
    • NR: maybe! depends on the use case
    • JMS: Zarr array API does have support for batch reading and writing - if you’re giving multiple shard from single chunk the overhead would be big - you need to tell the user to cache the index - maybe twice the read requests because of the index and the shard
    • NR: Having index in cache makes sense
  • DB: How to store the indexes in the cache? Maybe JSON? What would be good storage format? How about SQLite? (for my use case)
    • JMS: Unrelated to cache - you can list the chunk
    • DB: recursive and expensive
    • JMS: S3 can give you a flat list, may not solve the problem but the abstraction would help
    • JMS: are you doing re-encoding?
    • DB: yes
    • JMS: SQLite could be pretty reasonable solution
    • DB: Is there a clever way to do it using Zarr itself?
    • JMS: can be ordered using lexical graphs
    • DB: If your data is too big then your metadata becomes another type of data
  • SV: What do you think about using all-contributors and hosting a Zarr hack week to sync zarr-python implementation to V3?
    • All: Sounds good for all-contributors 👍🏻
    • EP: Having something online would be good and I’d participate
    • DB: Sounds good and I’d participate
    • WF and JMS: 👍🏻
  • DB: Defining abstract structure of zarr array like keys, properties, metadata - OME-Zarr has a try catch block - Have put some together with pydantic - it’s simple to generate zarr group - taking abstract tree representation and running it backwards to create Zarr groups
    • Repo: https://github.com/JaneliaSciComp/pydantic-ome-ngff
      • SV: Could be something like pip install ztree or something similar
    • DB: Trying to define protocol for HDF5 as well - could dump the hierarchy into Zarr or HDF5 container
    • DB: Structural sub typing stuff
    • SV: Would be good to show a demo?
    • DB: Yes!
  • JMS: https://github.com/zarr-developers/zarr-specs/issues/212
    • DB: Maybe similar to what vanilla N5 does!
    • JMS: Reasonable to add a metadata option - planning to add a extension ZEP for this