2023-04-19

Attending: Sanket Verma (SV), Norman Rzepka (NR), Ward Fisher (WF), Davis Bennett (DB), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS)

TL;DR:

Today marked the final day of PyCon DE and PyData Berlin 2023. SV presented a talk on Zarr titled ‘The Beauty of Zarr’, during which an attendee asked about the data types that Zarr is unable to handle. This topic was also discussed during our meeting. In addition, DB mentioned plans to experiment with S3 IO bandwidth, and we also had a conversation about codecs in V3 Specification.

Meeting Minutes:

  • SV gave a talk on Zarr @ PyCon DE 2023
    • Well received by the audience and good questions afterwards
    • Will post the video once it gets public
  • DB will be contracting for Janelia
    • Will be moving to Germany in 6 months for an indefinite time
    • Would love to work on the clean-up of Zarr-Python i.e. separating stores
  • DB: Been thinking about experimenting with maximum theoretical limit of S3 IO bandwidth of hitting raw objects - or maybe could get a high bandwidth
    • NR: Haven’t played around with that but would be interested in the experiment and see how Postgres work with blobs
    • NR: Haven’t used SQLiteStores - Postgres could be a good addition
  • SV: What type of data Zarr can’t handle?
    • DB: Opinionated answer: Tabular
    • SV: Same!
    • NR: Agree tabular data is not great for Zarr - but if you data types are same in all the columns then you can use Zarr over there as well!
    • DB: There’s a lot data types out there and Zarr is filling out a niche - lot of tabular data out there along with photos of cats! 🐈
  • SV: Conversations about Rust
    • DB: Had a look at the rust implementation of Zarr by Andrew Champion
    • SV: Folks have been re-writing their libraries in rust - also new tools to build and ship rust libraries - rattler
  • DB: Thoughts on V3 - Read it and opened a PR #228 to make some styling changes - Bit confused about codecs on what it takes and what it outputs!
    • NR: That’s a bit weird too! - Would be clear to define the workings
    • JMS joins in
    • JMS: We have 3 different types of codecs - array → array, bytes → bytes, and array → bytes - currently we don’t specify them - alternatively we can have 3 separate fields for these 3 codecs - maybe that’s not preferable - implementors care about this distinction but not the users
    • NR: Identifying which codecs has what interface is fine! - like the idea of unifying filters and codecs
    • DB: How similar are the semantics in HDF5? - Does NetCDF uses scale offset filter or is that done in metadata?
    • DH: HDF5 filter mechanism is purely bytes based bytes → bytes - it has a mechanism to determine the type of data being fed in; array or bytes - but find it is rarely used - NCZarr adopted the same model - easier for Python but not sure it’s easier for other systems
    • JMS: In most cases the implementation would not go for complex partial IO for the codecs (as described above) - they would do just complete transform
  • SV: Re-asked the question, “What type of data Zarr can’t handle?”
    • JMS: Thinks Zarr is not useful for relational databases
    • DH: On NetCDF we can store that because we’re using the struct - in general Zarr doesn’t have sophisticated querying mechanism which is bottleneck for tabular-like datasets
    • DB: Bite the bullet that every type of data has a different format and Zarr is not suitable for all of them!
    • DB: How about Parquet?
    • JMS: It’s column oriented and read the whole thing sequentially - people have been building index oriented on top of parquet - Google bigquery does the same thing and reads through the whole data
    • SV: Arrow is the successor of parquet and offers hierarchical storage as well
    • DB: Also offers in-memory representation
    • JMS: Parquet has added indexing! - Maybe Parquet is closes thing to Zarr in tabular data
    • DB: What would it look like to use Parquet as chunked format? - How would Zarr+Parquet look like? - Haven’t thought it through but could be exciting!
    • JMS: Could be!
    • SV: Maybe an extension to handle the tabular data?
    • JMS: If you want to store a table and read it in streaming fashion then we can already do that - but if we’re looking to querying mechanism maybe not! - also maybe not be useful to build it on top of Zarr as it is not built for tabular formats