13th February, 2019
Date and time: 20:00 GMT (register preferences here)
Joining instructions: https://zoom.us/j/300670033 (see Full Zoom instructions)
Attendees: Alistair, John, Josh, RyanW, RyanA, martindurant, Constantin, Stephan, Jan
Please feel free to propose an item for the agenda below. If you do, please include your name so we know who proposed the item.
Meta - about this meeting
(Agenda item proposed by Alistair Miles.) - 20:07 GMT
Brief recap of previous meeting for those who didn’t attend (discussion/minutes here - Zarr / n5 imaging use cases call 2019-01-15). Kicked off around microscopy and CZI use cases.
Brief introductions from new attendees and anyone who didn’t introduce themselves on previous call.
- RyanW: single cell analysis with CZI & their matrix service (zarr-based). Working on scala implementation, and excited about javascript implementation (scala can compiler to) for reading directly from cloud stores.
- John K: Janelia. Neural data from different parts of the cortex in mice, e.g., motor cortex, visual cortex. Calcium image data; video image data. Needed parallel writes, use zarr for that.
- Josh: Stuck between zarr community, link to CZI and human cell atlas, but also Java community, why I flagged Stephan. Looking towards single file format for bioimaging community, covering as many use cases as possible, can sell that to microscope vendors.
- RyanA: Columbia University, representing climate data community, high-dimensional data, satellite data, climate model output, 5-6 independent dimensions, many variables, increasingly high resolution, petabyte scale. World traditionally covered by netcdf. Leading pangeo project, scalable data analysis platforms on HPC and cloud, hit the HDF limitations particularly in cloud, object stores. Also developer on xarray Python package, main library for reading netcdf, I implemented the zarr backend with Joe Hamman, learned about zarr and dask, but now have xarray as translational layer between netcdf, zarr. Excited about performance and scalability of the stack. Encouraging climate community to embrace zarr, hopefully will happen through official netcdf team at unidata. Next version of netcdf should have a zarr backend, but still lots of detail.
- MartinD: Array formats: dicom in mri imaging; FITS in astronomy. Work for Anaconda now. Got connected with this community by working on remote filesystems (s3, gcs…). Dask team for 3 years. Being able to access data in remote storage is critical for Dask. So zarr or parquet important. Intake also cares about zarr as input format for xarray.
- Constantin: Visited Stephan’s lab in Janelia. Working with EM data of neural tissue, segmenting of cells, reconstructing connectivity, datasets 100s of TB, even PB. Being able to write in parallel is critical. I needed a way to connect with N5 from Python and C++. Also stumbled on zarr, saw specs very similar, I implemented C++ wrapper that can read both formats, and expose to PYthon. Only library to read both formats in C++. I use it heavily, mostly using N5, but wouldn’t be much to change to zarr.
- Stephan: I wrote N5. Working with large volumes of 3D to 5D from electron and light microscopy. We hit parallel writing when implemented stitching, many tile acquisition, wanted to use HDF5 for export format, previously people used TIF. Took 2 weeks to write the HDF5 file. Many features of HDF5 were nice. Implemented N5, used it on a Spark cluster. Exporting files took 30 mins instead of 2 weeks. Implemented first for Java with file system backend. Then block API could support other backends. Added backends for S3, GCS and HDF5 for fun. Now we can use N5 API to read from all these formats which works nicely in JVM. Wrote a consumer layer for block-based API for image processing library. Then we found zarr, reached out to Alistair, talked about differences in specs. Not super straightforward to replace one with the other, irregular block sizes. Constant came on to support python, z5py, network predictions.
- Jan: Work with Stephan, similar data. I wrote an N5 backend for zarr. Also had a heavy HDF5 problem.
RyanA: Connecting with Zuckerman Institute at Columbia.
RyanA: Looking at ways of implementing a data catalog. How does intake etc.
MartinD: intake language agnostic.
RyanW: How to you read HDF5?
Stephan: JHDF5. HDF5 backend always planning for Zarr, but HDF5 attributes strongly typed by limited structure.
RyanW: Uses C API, requires HDF5 local files.
RyanA: HDF5 people say there is a way to access cloud files.
Josh: I have notes. See hdfgroup call below.
John: There are some pure Java implementations of HDF5. Am not sure how well they cover the spec though.
Purposes of this meeting:
- Connecting people in the zarr/n5 community
- Capturing use cases & requirements, understanding how zarr/n5 is being used and where are major gaps in functionality / pain points / ….. Alistair: can use this regular meeting as a good place for capturing those.
- Establishing a community process for development of zarr/n5 specs and implementations.
- Stephan: interested in technical discussion for bringing the spec forward. See N5 as a vehicle for supporting Zarr. (Good to minimize dialects) Could start with use cases, e.g. why to have varlengths or unaligned chunk grids. Alistair: all feel that there need to be some iteration(s) on the spec including N5 and NetCDF. Can work through them collaboratively, but need some (lightweight) process for making those decisions. RyanA: giving (large) external organizations the feeling that they have a part in the discussion about what gets merged. Pangeo based governance on Jupyter. Alistair: keeping it hackable, not more complicated than we can all handle.
- Constantin: what models of governance? AM: NumFOCUS as a legal entity, but also RFC processes for example in numpy or python. MD: numpy does deal with interfaces. Dask is pursuing NumFOCUS affiliation as well. ( https://github.com/dask/governance )
- (Dropping down to next section for notes)
Schedule - continue every two weeks, at least for now? Sounds like yes.
Rotating chairperson? Josh Moore (next week)
Taking actions.
Any other comments/suggestions about how to run these meetings.
ACTION: create governance repo,
Dask governance discussions: https://github.com/dask/governance/issues
RyanA: Process for becoming a zarr core developer / member of decision making team. Need a pathway for people to become part of that group. The way to get a voice is to contribute to it. E.g., that’s how you get on the Jupyter steering council. Not just code contributions, come to meetings, comment on specs. Also something people can say as an achievement.
…we have working about that in pangeo governance doc, stole it from pangeo.
John: Compare other models. E.g., outside Python, Apache model?
Relationships with other groups
Links to HDF community?
Governance and affiliation
(Agenda item proposed by Alistair Miles.)
Related discussion - https://github.com/zarr-developers/zarr/issues/400
Some questions/discussion points:
- Do we pursue NumFOCUS affiliation like NumPy etc.? If so, what steps are needed to apply?
- What governance model should we adopt? Same as NumPy? Does anyone have direct experience of setting up governance for an OSS project, and can share any insights/experiences?
- If we do copy NumPy, what are the key things we’d need to implement?
- RyanA: BDFL? Mechanism when 2 communities don’t agree. See https://github.com/pangeo-data/governance/blob/master/governance.md Alistair: What does governance address? Martin: technical governance versus wider governance. JohnK: sweeping changes like rebuilding all of conda-forge. RyanA: not large enough to have multiple.
Spec proposals (Tabled)
(Agenda item proposed by Josh Moore.)
Based on the example of Zarr N5 spec diff:
-
What are the steps towards making a spec proposal? (timelines, process)
-
More concretely, are there opinions on options to attempt here?
- Unify? Minimal vs core specs?
Zarr/N5 compatibility (Tabled)
(Agenda item proposed by John Kirkham)
- Next steps for the N5 PR – https://github.com/zarr-developers/zarr/pull/309
- How to handle GZip discrepancies – https://github.com/zarr-developers/numcodecs/issues/169
- Handling shortened edge chunks or standardizing around uniform chunk size
- Other points I’ve missed?
AOB
-
Chair for next meeting? (Create agenda, link on ticket, etc.)
- Josh will set up the agenda for the next meeting and send
around.
- Josh will set up the agenda for the next meeting and send
Misc
Full Zoom instructions
Josh Moore is inviting you to a scheduled Zoom meeting.
Topic: Zarr/N5 weekly
Time: This is a recurring meeting Meet anytime
Join Zoom Meeting
One tap mobile
+16699006833,,300670033# US (San Jose)
+16468769923,,300670033# US (New York)
Dial by your location
+1 669 900 6833 US (San Jose)
+1 646 876 9923 US (New York)
+44 203 695 0088 United Kingdom
+44 203 051 2874 United Kingdom
+49 69 8088 3899 Germany
+49 30 3080 6188 Germany
+49 30 5679 5800 Germany
Meeting ID: 300 670 033
Find your local number: https://zoom.us/u/afcafu8bE
hdfgroup call notes
Here are my
notes from the conversation in case I’ve misunderstood anything:
- John as a part of work with NASA wrote a similar spec some years
ago [1]. Josh will review this for differences but briefly from the
discussion:
a. Intended as a direct representation of the HDF object model
b. An underscore is used as separator
c. A fileset represents a directed-graph
d. Datasets and Groups are stored flatted within top-level
directories and referenced by ID
- Some of these differences are mitigated by the fact that the
object storage schema is intended to have a HSDS service sitting in
front of it
- We discussed briefly the requirement for synchronization between
writing threads. An interesting consideration between the posix and
the cloud use case is the different interpretations of “corruption”:
a. in the posix case, two writers can lead to unintelligible data
b. while in the cloud case, two writers will lead to “hidden” data
(as previous versions of the bucket are overwritten)
- Dax from the sales team pointed out that there have been
involvement with the Broad (Fred Hutch?) for benchmarking formats for
the matrix service
- Elena from the engineering team pointed out that features like
those for parallel writing in the Imaris case have been available for
a while.
- Alexander from the engineering team point out that there are
number of features that don’t exist in zarr like object references.
- In terms of using the virtual object layer (VOL) in upcoming releases:
a. The upcoming version (1.12) will contain documentation, but
early access is possible.
b. A single shared library for all zarr access (s3 & local) would
work, but no S3 primitives are available.
c. The benefit would then be a consistent API across multiple platforms.
d. At the moment, the hdfgroup won’t have capacity for working
directly on integration.
[1] https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.rst