2022-10-05
Attending: Josh Moore (JM), Norman Rzepka (NR), Davis Bennett (DB), Ward Fisher (WF), Hailey Johnson (HJ), Martin Durant (MD), Jeremy Maitin-Shepard (JMS), Isaac Virshup (IV), John Kirkham (JK)
TL;DR:
Scalableminds has forked jzarr
and has started working on it. After that, there were discussions around how the Java implementation (jzarr
) would move forward and implement the V3 spec. After that, IV discussed the R implementation of Zarr and planned to discuss ragged arrays at the ZEP meeting tomorrow.
Updates:
- vEM sidenote
- NR: did Mathworks reach out to Unidata?
- WF: about NaN?
- NR: reached out and asked about Zarr for Matlab. They wanted a c/c++ library they could build on top of
- WF: not about that. Ellen Johnson is the main contact. Will add that to the list.
- :tada:
Agenda:
- Java and async
- scalableminds: Java backend. Forked jzarr and implementing in scala (read only)
- layers: fsspec, pure zarr, parallel (and async at all of those)
- (Tischi) imglib? NR: interop would be cool
- NR: Shared effort would be appreciated
- HJ: filters are pluggable and shareable
- HJ: definitely want to implement async for S3
- HJ: breaking THREDSS into microservices. also an option. (weren’t breaking netcdf-java)
- MD: python async. zarr is sync. barrier in the storage layer.
- NR: important to have the lower levels async. then join/wait
- NR: but don’t need cutouts. just interface to get specific chunks and (optionally) decode them
- abstractions: filesystems, zarr/n5/precomputed variants.
- MD: https://github.com/martindurant/async-zarr was a Python POC (targeted at Pyscript)
- https://github.com/bluesky/tiled (server)
- DB: would love async zarr in python
- JMS: tensorstore has python async. And s3 writing? No.
- JM: wrapping in tensorstore with Java?
- JMS: people tend to like native Java. Unsure how the memory measurement might work. Managing persistent buffers. Happy to help someone to look into it. (Little experience)
- MD: someone doing it for spark? pyspark talking to tensorstore
- JM: netcdf
- HJ: netcdf-java uses netcdf-c for writing NC4.
- WF: JNI/JNA tricky so didn’t do more, but for the writing there was no alternative.
- MD: HDFS drivers in Java land saw lots of barriers. Seemed to be the case that C native interface bridge that it raised uncertainties.
- R (IV)
- graphblas would also like to help get the libzarr c implementation outside of
- goal is a binary sparse format
- MD: Jim works on team. Thought community wasn’t behind Zarr.
- IV: from last Tuesday. Tried Zarr. Had issues. Tried HDF5 had more issues.
- MD: see also awkward arrays (had prototype of AA in zarr. bit clunky. work on convention for v3. would also work for sparse)
- WF: There is a lot of modularization in the C library source, in terms of a proof of concept, we might be able to create a separate library. But I also recognize I’m looking for something to work on besides spinning up a grant proposal.
- WF: internal considerations. capacity. etc.
- but from code organization standup, code is in separate library. compiled as object that is linked against.
- likely wouldn’t take a ton of work. need to examin ie.
- HJ: the netcdf-java implementation also has the zarr package serparated from CDM (i.e. ditto)
- WF: precendence with the udunits package. similar to how nczarr is now. (now maintained by outside unidata team)
- IV: good to hear. maybe also what Jim wanted to hear. segfaults.
- WF: putting new release out soon. bug fixes & messaging. (IV: need conda package)
- MD: sparse data needs a binary storage. so good time to write down those requirements
- MD: gets us into variable chunk sizes which are needed in zarr for these uses.
- IV: don’t care just yet. (not accessing by chunk right now)
- DB: why not just numpy sparse storage? MD: you want configurability
- DB: sparse as a codec? you don’t want to build that in memory. i.e. you want a sparse representation
- also: what does it even mean
- JMS: think we can shoe horn sparse into Zarr but don’t think just encoding them as dense arrays is the best way. need to add variable sized chunks to do something reasonable. maybe based on store abstraction but not on array abstraction.
- MD: if there are other reasons for variable sized chunks. JMS: even then it’s show-horning.
- IV: encoding as zgroup? JMS: depends on operations (e.g. chunks in spatial region)
- MD: you can definitely design a binary format (but there isn’t one). zarr is already around though rather making a new binary format.
- IV: all formats are a collection of dense 1-D or maybe 2-D arrays
- JMS: in-memory arrays have different constraints. parallel writes?
- JMS: you don’t really care about the linear mapping.
- IV: number of sparse formats is due to the number of use cases
- MD: cf. array’s feather format. dump of each buffer that it takes to reproduce it. and it’s fine. (no chunk wise access. no parallel reads from remote)
- i.e. Arrow is not better or worse
- Zarr with a convention is Arrow
- IV: chunking in arrow
- IV: another use case is variable length strings (in V3)? MD: not as numpy arrays
- JMS: v2 does but you can’t index in the string
- MD: numpy object arrays
- JMS: most obvious zarr encoding is having each chunk sparse. DB: why isn’t that the “zarrthonic” way?
- JMS: not on top of zarr-python. M: don’t want
foo[:]
. IV: most libraries don’t comply to numpy. dtypes of each 1D array.
- JMS: not on top of zarr-python. M: don’t want
- JM: https://data-apis.org/
- MD: see cupy work
- DB: in dask you should be able to use. there’s a meta keyword argument that gets passed around that contains a description of the soul/truth of the array
- JM: generally good :+1: let’s work on it
- IV: have already done this for some sparse array types in anndata
- JM: should capture as convention/extension
- IV: variable size chunks would help
- JM: Martin, someone already working on that spec?
- MD: spec change is easy, but implementation work is harder. (dask array may have some of the logic)
- DB: any objections?
- MD: there were some, but can’t summarize
- JMS: boundaries in separate place or in metadata
- MD: Ryan Abernathey thinks only the storage layer should know about it (not in metadata)
- any read would need to pass not just a key but also where in the array it is
- IV: https://github.com/zarr-developers/zarr-specs/issues/62
- JMS: doesn’t seem very natural in the storage layer. i.e. not MutableMapping API.
- MD: don’t expect chunks in one dimension to get so large that the metadata is a problem
- IV: have students who could work on this
- JMS: what happens on re-sizing
- MD: API is to be determined but that’s an implementation question
- JMS: unless you need a “default chunk size” metadata or similar.
- IV: ZEP meeting (tomorrow) the place to bring up ragged arrays
- JMS: harder than sparse
- IV: ok not being able to index into the variable length arrays
- vlencodecs … need spec …
- running out of steam …
- IV: was looking at dtype extensions and thought perhaps ragged arrays could go there. but perhaps another one. which oen is right?
- JMS: dtype, generic extensions, … would need custom codec and generic.
- IV: might make sense to reduce what’s allowed for the dtype extensions