2022-11-02

Attending: Sanket Verma (SV), Josh Moore (JM), Dennis Heimbigner (DH), John Kirkham (JK), Norman Rzepka (NR), Jeremy Maitin-Shephard (JMS), Davis Bennett (DB), Ward Fisher (WF), Martin Durant (MD), Isaac Virshup (IV)

TL;DR:

The Outreachy contribution phase is ending after four weeks, and we had a fantastic time working with them. We’ll select 1/2 interns to work with us for three months (December-February). SV is speaking @ PyData Global 2022 next month and planning to host Zarr Sprint. Special kudos to JK for working actively on numcodecs! After that, IV started a discussion on memory mapping.

Updates:

  • Outreachy contribution phase ending on 11/4 (Thanks to all who helped us! 🙌🏻)
    • 2 more days then time to choose an intern to work with us for 3 months
    • JK: choose more than 1? Possible. Feedback welcome.
  • #beautifulzarr born out of Outreachy: https://github.com/zarr-developers/beautiful-zarr
  • Speaking at PyData Global 2022 📣
    • “The Beauty of Zarr”
    • Planning a Zarr Sprint! 🏃🏻‍♂️ Anyone like to volunteer?
      • Collect issues and attendees (some 2000+) get involved
      • 3-4 maintainers/contributors should be present
      • 1-2 hour committment
      • 1-3rd of December
  • ZEP0003 by Martin and Isaac is in draft; read here :tada:

Open Agenda(add here 👇🏻):

  • MD: kudos to the push on numcodecs
  • issue moved to private issue
  • IV: Question about where memory mapping is at (https://github.com/zarr-developers/zarr-python/pull/377, https://github.com/zarr-developers/zarr-python/pull/1131)
    • MD: related to passing contexts down to the reading
    • JK: added in DirectoryStore _from_file() to define how the reading is done
    • MD: with codec, produces a regular array
    • JK: previously updated to use pybuffer protocol (codec with decompression will do a copy)
    • DB: use case? for large amounts of single cell data. resampling for neural networks. DB: chunk size doesn’t help. IV: not because of the sparseness. but even if dense, the random access
    • MD: even better would be on slice selection of zarr object, pass the byte range into the loader (done with blosc blocks in v2. cheap sharding) memory mapping just exposes an extra layer that you might not need.
    • IV: “fast as possible reading” (where disk size isn’t an issue)
    • DB: using zarr array API? doesn’t seem like that would work
    • MD: similar to kerchunk. want to build a utility, pretend N chunks for random-access.
    • IV: how much work until we can pass the byte-range down?
    • MD: discussed in several places. 1131 is likely to win.
    • JM: does FSStore support it? We think so.
    • IV: and ZipStore? MD: requires some work.
    • IV: .. and gets passed to pytorch and is multi-threaded …
    • DH: effect on caches? (general “good question” nodding)
      • JMS: page cache? No. e.g. nczarr’s chunk cache (DB problem)
      • MD: if you’re caching, then whole chunks and read partially from them rather than reading partial chunks. fsstore has something for parts of files, but it’s messay
      • DH: potential for optimizations.
      • MD: good question when we get to subselections of a chunk