Namaste Zarr Community! 🙏🏻

I hope you are doing great. Recently, there has been a lot of exciting development on the Zarr front. Some of them are new collaborations, conventions, research publications, submission of new ZEPs, etc. I’ll cover those in separate blog posts. Meanwhile, I want to talk about something interesting for the Java community in the Zarr ecosystem. ;)

OME hosted a 4-day event last November, and on one of the days, they discussed the future of the Java implementation of Zarr, i.e. Zarr-Java, extensively. The discussion was centred around the needs, current state and future work needed to have a solid foundational Zarr implementation in Java which is much needed by the community. This blog post aims to summarise the important sections of the meetings, which could be used as a reference for future work in the development of Zarr-Java.

Disclaimer: This blog post is a naive attempt by me to understand the vast Java and Zarr ecosystem and summarise it in a few sentences, so if you think I misinterpreted something, feel free to point it out. I’m more than happy to be told that I was wrong. :)

Thanks to Chris, Sebastien, and Norman on behalf of Glencoe Software, OME and Scalable Minds for putting the slides together, delivering and moderating both sessions.

The reason for bringing everyone together @ OME2022 👩🏻‍🤝‍👨🏼👨🏿‍🤝‍👨🏻👩🏿‍🤝‍👩🏼

First, I’m going to focus on the reason why this meeting took place. There is no single Java implementation of Zarr on which the whole Java community could rely. It might be too big of an ask, but having something like Zarr-Python for the Java community would be perfect. The JVM Zarr Community is fragmented, which increases friction in the community and further affects and delays the adoption of a single OSS for the whole community. Until now, the developers/research groups/companies of the Java ecosystem who need the Zarr package have been forking various implementations of Zarr and trying to get them to work according to their use case. It might help a single cause, but it certainly doesn’t help the larger community. Moreover, these forked libraries are unmaintained when the desired use case is achieved due to the lack of resources and developer support. Having multiple similar libraries also affects confidence and trust and puts the community in a state of ambiguity on which project to rely on.

As it is evident from above, there is a strong need for a community-wide accepted Zarr-Java project which covers all the essential baseline features from the Zarr specification. This also creates room to strengthen and improve the existing Zarr specification by introducing better cross-language engagement and participation.

There’s also a need to define baseline features for Zarr-Java that the reference implementations should support. The requirements, as shared during the sessions, are:

Baseline requirements:

  • Java 8+
  • Support for Zarr V2 Specification (including dimension separator)
  • Inspired by Zarr Python API foundational concepts (store, compression, chunk)
  • Data types: signed/unsigned integers 1 -> 8 bytes, 4 and 8-byte floating point
  • Stores: Filesystem, in-memory, HTTP, Amazon S3
  • Extensible compression: blsoc, zstd, lz4, zlib, bzip2, lzma at least
  • Chunk API
  • Basic Slice API

Nice to have features:

These requirements are fair ask and, if/when developed, will serve as a solid foundational block for the Zarr-Java ecosystem.

Moving on, let’s see what Zarr’s history of development in the Java sphere has been like for the past years.

What’s the history been like? 🕥🕣🕡

As you can see from the above timeline,

The earliest development started in October 2011 by scalableminds on webknossos which lets you annotate, visualise and share N-dimensional arrays. After that, folks at Janelia began working on Java NGFF (Next Generation File Format) via N5. Finally, the first conversation for having JVM Zarr Implementation started in 2018 in the zarr-developers/community, which can be seen here.

This led Ryan Williams from Zarr Steering Council to work on laseronlab/ndarray.scala. Zarr’s first pure Java implementation was not seen until 2019 by Brockmann Consult, which lives here at bcdev/jzarr. The efforts from the Brockamnn group are commendable as jzarr is one of the precise adoptions of the Zarr specification. Even though it’s been almost a year since the last commit, no other Zarr Java implementation comes close to what jzarr can achieve.

After this, various interesting projects showed up, as seen on the timeline, which included the adoption of Zarr Specification in some manner. Chris did an excellent job explaining these various projects in the morning session, which can be seen here. I’d highly recommend listening to him before going further.

Despite these outstanding efforts by exceptional groups and individuals, the community remained somewhat fragmented, and there is a strong need to unite and work on a collaborative project.

Current state of work 🗂️

The jzarr 0.3.5, jblosc 1.0.1 and Amazon S3 JSR-203 Java 7 NIO2 Implementation are the most stable, well-documented and cohesive OSS projects. The jzarr adaption of the Zarr specification is quite good, and most of the community is using it. But despite its merits, there are certain limitations; they are:

  • The S3 anonymous access is somewhat broken and doesn’t play nicely with S3-compatible storage
  • The project hasn’t been maintained properly in a long time
  • JZarr feature support and the community support to add new features like Sharding and V3 is also not quite good

There are other options the community could look at, like N5+N5-Zarr, Z5, ndarray.scala or NetCDF-Java. But when we deep dive into their codebase, existing framework, learning curve, and adoption of the Zarr specification, it seems like every other project falls short on one or many critical features which are absolutely needed. Again, Chris did a fantastic job explaining those, and you can listen to it here.

If you like great visuals instead, Josh Moore prepared a neat matrix of the current state of projects.

Here S denotes fully supported, P denotes partially supported, and N denotes not supported.

This clearly says the community needs a Zarr implementation that ticks all the boxes mentioned above. So let’s have a look at what is being proposed.

What’s coming next? 🔮

The proposal for moving forward looks something like this:

The idea here is to assemble people with a specific skill set and bring them to work together under the zarr-developers umbrella. There was a Q&A session after the presentation. Some of the crucial insights from the QnA are:

  • There are many opinions on several important stuff like design choices, what compression to use, consolidating arguments etc., and we’d like to hear from you and get as many hands as we can to work together. We welcome community participation and contributions
  • Forming a consensus on some critical design choices for Zarr-Java
  • Getting Zarr-Java in momentum is not only a developer’s task but also a matter of community participation and engagement
  • The Java community needs to participate in the discussions related to SPEC to help their cause, and if they don’t, they are going to be left behind
  • We’re mostly going to learn things by getting our hands dirty

Since this blog post aims to summarise the meetings, I can only cover a tiny portion of the Q&A sessions. So I’d encourage you to listen to the QnA sessions from morning and afternoon sessions to see what the community thinks of this effort and how excited they are.

Session recordings 🎬

You can watch the full recording of morning session here 👇🏻:

And afternoon session here 👇🏻:

~

That’s it from my side. I hope this post was helpful and summarised the discussions well. If there’s anything not clear, critics are welcome!

Keep watching this space as I try to cover the advancements of what we’ve discussed. As always, if you’d like to get involved, feel free to drop ‘Hi’ on our Gitter channel. Until next time. Peace! ✌🏻

~Sanket Verma