2023-02-22
Attending: Davis Bennett (DB), Sanket Verma (SV), Ward Fisher (WF), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Norman Rzepka (NR), Eric Perlman (EP), Dieu My Nguyen (DMN), Virginia Scarlett (VS)
TL;DR:
We started the meeting by discussing some possible performance improvements for sharding. Ryan Abernathey did some benchmarks a few weeks ago, which turned out well. Then, the community offered a few solutions to make it even better. Jonathan is working on a PR for the same. After that, DB wondered, “How can he only store the indexes in the cache using Zarr?”
SV asked what does everyone think about the all-contributors bot. He also questioned whether the community would like to participate if Zarr were to organise a hack week. Both ideas were received 👍🏻 from everyone.
Lastly, DB showed us what he’s working on, and JMS discussed the zarr-specs
issue he submitted recently.
Updates:
- Zarr-Python 2.14.1 release with sharding and new docs theme (PyData-Sphinx)
- SciPy 2023 deadline next week (03/01), if you want to collaborate on a tutorial with us, now’s the time
Meeting Notes:
- DB: Sharding is slow - according to https://github.com/zarr-developers/zarr-python/issues/1343
- NR: Jonathan is working on a PR
- DB - Using mutable mapping for interface store for
getitems
andsetitems
is the probable cause - Zarr array API has some bash logic - which may also lead to slowing down - DH: What type of caching model is using?
- NR: No caching is there! - individual chunks are loaded through byte range
- DH: having cache would be fine
- NR: maybe! depends on the use case
- JMS: Zarr array API does have support for batch reading and writing - if you’re giving multiple shard from single chunk the overhead would be big - you need to tell the user to cache the index - maybe twice the read requests because of the index and the shard
- NR: Having index in cache makes sense
- DB: How to store the indexes in the cache? Maybe JSON? What would be good storage format? How about SQLite? (for my use case)
- JMS: Unrelated to cache - you can list the chunk
- DB: recursive and expensive
- JMS: S3 can give you a flat list, may not solve the problem but the abstraction would help
- JMS: are you doing re-encoding?
- DB: yes
- JMS: SQLite could be pretty reasonable solution
- DB: Is there a clever way to do it using Zarr itself?
- JMS: can be ordered using lexical graphs
- DB: If your data is too big then your metadata becomes another type of data
- SV: What do you think about using all-contributors and hosting a Zarr hack week to sync
zarr-python
implementation to V3?- All: Sounds good for
all-contributors
👍🏻 - EP: Having something online would be good and I’d participate
- DB: Sounds good and I’d participate
- WF and JMS: 👍🏻
- All: Sounds good for
- DB: Defining abstract structure of zarr array like keys, properties, metadata - OME-Zarr has a try catch block - Have put some together with pydantic - it’s simple to generate zarr group - taking abstract tree representation and running it backwards to create Zarr groups
- Repo: https://github.com/JaneliaSciComp/pydantic-ome-ngff
- SV: Could be something like
pip install ztree
or something similar
- SV: Could be something like
- DB: Trying to define protocol for HDF5 as well - could dump the hierarchy into Zarr or HDF5 container
- DB: Structural sub typing stuff
- SV: Would be good to show a demo?
- DB: Yes!
- Repo: https://github.com/JaneliaSciComp/pydantic-ome-ngff
- JMS: https://github.com/zarr-developers/zarr-specs/issues/212
- DB: Maybe similar to what vanilla N5 does!
- JMS: Reasonable to add a metadata option - planning to add a extension ZEP for this