We’ve got an exiciting announcement for you all! 😄
We’re excited to announce that the Zarr community has made a move from Gitter to Zulip as our primary chat platform. This transition marks a new chapter for our community and offers several advantages for our members.
Join here → https://ossci.zulipchat.com/
Zulip offers a robust and versatile platform for communication and collaboration. Its threading model allows for organized and focused discussions, making it easier for community members to follow and participate in conversations effectively. Additionally, Zulip provides powerful search capabilities, ensuring that valuable information shared in the past remains accessible to all.
Zulip’s unique message sharing feature allows conversations to be easily shared around the web via unique links. In addition, Zulip’s indexing of all content by search engines ensures that the knowledge base is easily accessible to all users.
We extend our sincere gratitude to the good humans at the Open Source Science Initiative (OSSCi) for generously hosting the Zulip server. Their commitment to supporting open science and collaborative research is commendable, and we’re thrilled to partner with them on this endeavour.
Shoutout to Jonathan Starr for helping us! 🙌🏻
The OSSCi Zulip server will serve as a hub for various projects in the scientific Python ecosystem, starting with Zarr. By centralising communication within this platform, we aim to foster greater collaboration, knowledge sharing, and community building among like-minded individuals passionate about open science and research.
With this migration, the OSSCi Zulip server becomes the official chat platform for the Zarr community. We encourage all Zarr users, contributors, and enthusiasts to join us on Zulip to stay updated on the latest developments, seek assistance, and engage with fellow community members.
At Zarr, we value the input and ideas of our community members. We’re committed to continuously improving our platform and user experience. Therefore, we welcome any feedback, suggestions, or ideas you may have regarding the Zulip migration or any other aspect of our community. Your input helps us better serve the needs of our users and advance our shared goals.
Please create an issue in zarr-developers/community or join one of our community meetings if you’d like to chat with us!
Ready to join the conversation? Head over to the OSSCi Zulip server and dive into discussions surrounding Zarr and other exciting projects in the scientific Python ecosystem.
With our shift from Gitter to Zulip, it’s worth mentioning that the majority of discussions on Zulip have involved the core developers of Zarr. Now, we’re extending our warm invitation to the wider community to join us on Zulip. Your involvement is crucial as we foster a more inclusive and vibrant community.
We look forward to connecting with you there! ✌🏻
~Sanket Verma
]]>Recently, I and several community members have been speaking at various conferences and events. There has been an exciting development in the Zarr ecosystem, like finalising V3 specification, submitting new ZEPs, initiating new implementations, etc.
While I’m mostly giving beginner talks on Zarr, which answers how, why, and what, the enthusiastic community members have been talking about other exciting stuff!
In this blog post, I highlight a few talks which were delivered in the past two months. Also, we’re maintaining a playlist on YouTube, which has a more extensive collection of talks from various domains and diverse speakers. Check the playlists: Zarr: Introductory Talks and Zarr: Projects, Uses, Research and Workflows.
I went to Berlin, Germany, in April to speak at PyCon DE and PyData Berlin 2023. My talk was titled “The Beauty of Zarr”, where I emphasised the inner workings using some near illustrations by Trevor Manz. I highlighted how simple, convenient and hackable it is to use Zarr. After going through various explanations, I focused on some critical issues that Zarr eradicates because of its design and workings, i.e. chunking, compression, cloud-enabled etc.
Towards the end, I prepared a Jupyter notebook where I walked through Zarr 101 code to create, read, write and manipulate arrays. I also converted the Zarr pixelated logo from .png
to .zarr
format, which was a neat closing for my talk.
The slides and notebook can be accessed here.
Please watch the video here: 👇🏻
Earth Science Information Partners (ESIP) is a community of data and information technology practitioners working together to coordinate earth science interoperability efforts. ESIP has various collaboration areas. ESIP Collaboration areas are made up of administrative committees and small working groups that are called clusters. Some of them are:
And many more.
The ESIP Cloud Computing Cluster organised a three-part series on Zarr titled “Zarr: The Next Generation” In every part, the Zarr Community members talked about several things ranging from V3 to conventions to ZEPs.
The first part took place on March 27th where:
The video recording of the session can be seen here: 👇🏻
The second part took place on April 24th where:
The video recording of the session can be seen here:
The third part took place on May 22nd where:
The video recording of the session can be seen here:
These meetings covered a great deal of recent developments in the Zarr ecosystem. The ZEPs mentioned above explained the V3 specification, sharding, and a couple of new exciting features the community is working on. The interesting thing to note here is that the ZEP0003 and ZEP0005 are something the community members wrote to support their use-case in their domain. This shows the openness and flexibility of the Zarr open-source community and how we support everyone. Though these ZEPs are still in the draft state, they’ll be finalised soon for adoption.
I will discuss about V3 specification in a separate blog post, so I’d not go into the details here. But it’s worth noticing GeoZarr specification and what Briana presented. GeoZarr is one of the conventions on top of Zarr specification, which support various use cases of the geospatial community on how they store their data and metadata. The GeoZarr SWG (Steering Working Group) has been working quickly despite the roadblocks (as mentioned by Briana). The progress and specification can be seen here.
These are some of the public engagements done by the Zarr Community members in the past months. If you spoke on Zarr recently or in the past and would like me to highlight your talk, please don’t hesitate to contact me. If you’re working on something interesting which involves Zarr and want to share it with the community, please say ‘Hi’ to me!
I’ll be talking to you all soon.
Until next time, peace! ✌🏻
~Sanket Verma
]]>I hope you are doing great. Recently, there has been a lot of exciting development on the Zarr front. Some of them are new collaborations, conventions, research publications, submission of new ZEPs, etc. I’ll cover those in separate blog posts. Meanwhile, I want to talk about something interesting for the Java community in the Zarr ecosystem. ;)
OME hosted a 4-day event last November, and on one of the days, they discussed the future of the Java implementation of Zarr, i.e. Zarr-Java, extensively. The discussion was centred around the needs, current state and future work needed to have a solid foundational Zarr implementation in Java which is much needed by the community. This blog post aims to summarise the important sections of the meetings, which could be used as a reference for future work in the development of Zarr-Java.
Disclaimer: This blog post is a naive attempt by me to understand the vast Java and Zarr ecosystem and summarise it in a few sentences, so if you think I misinterpreted something, feel free to point it out. I’m more than happy to be told that I was wrong. :)
Thanks to Chris, Sebastien, and Norman on behalf of Glencoe Software, OME and Scalable Minds for putting the slides together, delivering and moderating both sessions.
First, I’m going to focus on the reason why this meeting took place. There is no single Java implementation of Zarr on which the whole Java community could rely. It might be too big of an ask, but having something like Zarr-Python for the Java community would be perfect. The JVM Zarr Community is fragmented, which increases friction in the community and further affects and delays the adoption of a single OSS for the whole community. Until now, the developers/research groups/companies of the Java ecosystem who need the Zarr package have been forking various implementations of Zarr and trying to get them to work according to their use case. It might help a single cause, but it certainly doesn’t help the larger community. Moreover, these forked libraries are unmaintained when the desired use case is achieved due to the lack of resources and developer support. Having multiple similar libraries also affects confidence and trust and puts the community in a state of ambiguity on which project to rely on.
As it is evident from above, there is a strong need for a community-wide accepted Zarr-Java project which covers all the essential baseline features from the Zarr specification. This also creates room to strengthen and improve the existing Zarr specification by introducing better cross-language engagement and participation.
There’s also a need to define baseline features for Zarr-Java that the reference implementations should support. The requirements, as shared during the sessions, are:
Baseline requirements:
Nice to have features:
These requirements are fair ask and, if/when developed, will serve as a solid foundational block for the Zarr-Java ecosystem.
Moving on, let’s see what Zarr’s history of development in the Java sphere has been like for the past years.
As you can see from the above timeline,
The earliest development started in October 2011 by scalableminds on webknossos which lets you annotate, visualise and share N-dimensional arrays. After that, folks at Janelia began working on Java NGFF (Next Generation File Format) via N5. Finally, the first conversation for having JVM Zarr Implementation started in 2018 in the zarr-developers/community, which can be seen here.
This led Ryan Williams from Zarr Steering Council to work on laseronlab/ndarray.scala. Zarr’s first pure Java implementation was not seen until 2019 by Brockmann Consult, which lives here at bcdev/jzarr. The efforts from the Brockamnn group are commendable as jzarr is one of the precise adoptions of the Zarr specification. Even though it’s been almost a year since the last commit, no other Zarr Java implementation comes close to what jzarr can achieve.
After this, various interesting projects showed up, as seen on the timeline, which included the adoption of Zarr Specification in some manner. Chris did an excellent job explaining these various projects in the morning session, which can be seen here. I’d highly recommend listening to him before going further.
Despite these outstanding efforts by exceptional groups and individuals, the community remained somewhat fragmented, and there is a strong need to unite and work on a collaborative project.
The jzarr 0.3.5, jblosc 1.0.1 and Amazon S3 JSR-203 Java 7 NIO2 Implementation are the most stable, well-documented and cohesive OSS projects. The jzarr adaption of the Zarr specification is quite good, and most of the community is using it. But despite its merits, there are certain limitations; they are:
There are other options the community could look at, like N5+N5-Zarr, Z5, ndarray.scala or NetCDF-Java. But when we deep dive into their codebase, existing framework, learning curve, and adoption of the Zarr specification, it seems like every other project falls short on one or many critical features which are absolutely needed. Again, Chris did a fantastic job explaining those, and you can listen to it here.
If you like great visuals instead, Josh Moore prepared a neat matrix of the current state of projects.
Here S
denotes fully supported, P
denotes partially supported, and N
denotes not supported.
This clearly says the community needs a Zarr implementation that ticks all the boxes mentioned above. So let’s have a look at what is being proposed.
The proposal for moving forward looks something like this:
zarr-developers/zarr-java
and bring the best ideas, concepts, and code from N5-Zarr, Zarr, and NetCDF-Java into a reference libraryThe idea here is to assemble people with a specific skill set and bring them to work together under the zarr-developers umbrella. There was a Q&A session after the presentation. Some of the crucial insights from the QnA are:
Since this blog post aims to summarise the meetings, I can only cover a tiny portion of the Q&A sessions. So I’d encourage you to listen to the QnA sessions from morning and afternoon sessions to see what the community thinks of this effort and how excited they are.
You can watch the full recording of morning session here 👇🏻:
And afternoon session here 👇🏻:
~
That’s it from my side. I hope this post was helpful and summarised the discussions well. If there’s anything not clear, critics are welcome!
Keep watching this space as I try to cover the advancements of what we’ve discussed. As always, if you’d like to get involved, feel free to drop ‘Hi’ on our Gitter channel. Until next time. Peace! ✌🏻
~Sanket Verma
]]>The holidays are just around the corner, and we wanted to share the good news about the Outreachy participation, which shows the drive and motivation of contributors, the trust and strength of the open-source community, and the resiliency of the Zarr project.
The Initiation 🏁
It was October’s first week, and our community Gitter channel saw a sudden wave of incoming messages from Outreachy participants (i.e. Outreachies). Initially, the messages mainly stated that they were excited to find the Zarr project and looked forward to contributing. I was happy to see the initial response from the Outreachies. I thought I did a good job writing the project descriptions over at the Outreachy portal that everyone liked and now considering working with us for three months.
Speed-bumps 🚧
But after a few days, the messages shifted the tone from “Hi, I’m excited to be here…” to “Hi, I have this technical issue…” and their intensity increased manifold. It started getting difficult to manage the Zarr community and incoming Outreachies on a single Gitter channel, and we thought of creating a separate Gitter channel for the Outreachies. We started interacting with all the participants and helping them out late at night.
🤝🏻
Everyone was going fine until we noticed that there should be a central place/guide that every aspiring intern could refer to when starting their open-source contribution journey to Zarr with Outreachy. We also thought it’d help us to prevent answering the same questions multiple times at multiple places (Gitter, Emails, Twitter etc.) After this, I wrote two blogs on helping the Outreachies during their contribution phase, which can be seen here:
The strength of open-source 💪🏻
I remember in one of the Zarr community meetings, all of us were happy to see the participation by Outreachies during the phase. During the meeting, I mentioned that Josh and I are the ones who are primarily engaging with the participants, and we’d love to see others from the community helping us. After that meeting, I could see some of the long-time contributors of Zarr jumping in and helping us in aiding the Outreachies by answering their questions, reviewing pull requests, providing suggestions and finally merging their PRs. I believe that’s the real spirit of Open-Source, and I was almost idyllic to see that.
After four weeks of the contribution phase, here are some stats:
Finally… 🥺
It took us some time to pick the best of the many qualified applicants, and after going through the immensely talented pool of Outreachies, we finally selected two interns to work with Zarr for the December 2022 cohort. They are:
Though it’s only been a week since they started their work, they’re on an excellent start. Weddy’s blog can be seen here, where she’ll be posting her updates. Similarly, Awa’s blog can be seen here, and I like the blog’s theme. Both of them have posted their first introductory blog post, and I was motivated to learn about their core values which drives them to work hard with integrity.
I’ll be cross-posting their upcoming blog posts regularly here, so please keep an eye on this page. I’m excited to help them build new things for Zarr and look forward to working with them. Until next time. Peace! ✌🏻
~Sanket Verma
]]>The specification is now being finalized and needs your feedback before the Zarr Steering Council and the Zarr Implementation Council vote on it. In two weeks, (on the 19th of December), the spec will go into feature-freeze, meaning only changes (issues) that have been previously discussed will be incorporated. The documents in question are:
Motivation and Context
The Actual Spec
If you are familiar with V2, you might want to start at the comparison with V2 section of the specs.
Please feel free to:
~Jonathan Striebel, scalableminds and Jeremy Maitin-Shepard, Google
]]>It’s been more than a week since the Outreachy Contribution Phase opened, and we’ve been actively engaging with the applicants through our Gitter chat, GitHub issues, PRs etc.
If you haven’t read the first blog post, please check it here: https://zarr.dev/blog/outreachy-contributor-guide/.
We’re so happy and excited to see the enthusiasm of the applicants, and sometimes it’s been hard to keep up with all of the queries and messages (sort of a good thing). :)
We were going through the existing PRs and thought of finding an additional way to contribute to Zarr and engage them better with the Zarr community. So we came up with the idea of #beautifulzarr.
It is a repository under zarr-developers GitHub, which would host beautiful use cases, visualisations, code snippets, screenshots and Twitter/social media mentions of #zarr and Zarr data.
We couldn’t think of a better time to start the compilation of fantastic work the community is doing with the help of Zarr. We plan to build and evolve this repository over time and curate its contents on our homepage.
How can you contribute to #beautifulzarr?
Fork the repository, create a new folder inside the _data folder and name it your GitHub username. It should look like this _data/<YOUR-USERNAME>
. Ex.
_data/MSanKeys963/
Browse the vast internet 🌏 (WWW) and find how the community uses Zarr for their work. Try to capture their work in screenshots, code snippets (along with visualisations), use cases, and mentions on Twitter/social media (#zarr or similar). Add your result in a .md
file in the folder created above. Ex.
_data/MSanKeys963/README.md
Submit a GitHub Pull Request. The maintainers for this repo will review your PR and then merge it.
Here are some tips to get you started:
→ Tip 1: A few code snippets show how easy it is to visualise Zarr data. Check here. 💡
→ Tip 2: Check out the MSanKeys963’s folder and README.md to get some inspiration! 🌳
Once your PR is accepted and merged, you’ve successfully passed the Outreachy contribution phase. 🎉
Now you have to wait for the final results, or you can start working on additional issues to increase your chances of selection. Please refer to the first blog here for additional work. 🤞🏻
If you have any queries during the contribution phase, don’t hesitate to ask them in the Gitter. We’re here to help you!
Happy Contributing! ✌🏻
~Sanket Verma
]]>We’re elated to see the initial participation from interns in the contribution period for the Outreachy December 2022 cohort. This blog post aims to briefly explain how you can contribute to Zarr open-source project and prepare yourself for the listed projects here.
Zarr has submitted three projects for Outreachy this time, and every project requires a fundamental understanding of Zarr along with some additional skills for every project. So though the initial steps for all three projects remain the same, I’d explain the contribution steps for every project separately, emphasising some of the essential skills required for each. Before we start, here are some of the important links:
Outreachy project link: https://www.outreachy.org/outreachy-december-2022-internship-round/communities/zarr/#create-tutorials-for-zarr
Please go through the complete project details. You may need to sign in/sign-up to see them. 👀
Feel free to ask questions related to setting-up Zarr in the Gitter chat. 🙋🏻♂️
Once you’re done setting up Zarr:
PS. We’d strongly encourage you to go through the list of open issues first and then ask for help in the Gitter chat.
Outreachy project link: https://www.outreachy.org/outreachy-december-2022-internship-round/communities/zarr/#managing-zarr-releases-with-rever
Please go through the complete project details. You may need to sign in/sign-up to see them. 👀
Once you’re done setting up Zarr:
Once you’re done with Rever’s documentation, try to create rever.xsh in the zarr-python repository.
PS. We’d strongly encourage you to go through the list of open issues first and then ask for help in the Gitter chat.
Outreachy project link: https://www.outreachy.org/outreachy-december-2022-internship-round/communities/zarr/#testing-the-support-and-interoperability-of-zarr-z
Please go through the complete project details. You may need to sign in/sign-up to see them. 👀
Once you’re done setting up Zarr:
Once you’re done with the above steps, create a ZipStore using zarr-python and test the support across various Zarr implementations.
PS. We’d strongly encourage you to go through the list of open issues first and then ask for help in the Gitter chat.
→ Once your PR is accepted and merged, you’ve successfully passed the Outreachy contribution phase. 🎉
Now you have to wait for the final results, or you can start working on additional issues to increase your chances of selection. 🤞🏻
If you have any queries during the contribution phase, don’t hesitate to ask them in the Gitter. We’re here to help you!
Happy Contributing! ✌🏻
~Sanket Verma
]]>I hope my previous blog was a good read and worth your time. Just to shed some light on ZEPs, recently, ZEP1 was submitted by Alistair Miles and Jonathon Striebel and is currently under review by the Zarr Implementations Council and the Zarr community. Feel free to leave your thoughts on ZEP1 here. I’m pleased to see the ZEP process in work and hope it assists the Zarr community in systematically achieving critical milestones.
In early 2021, we submitted a proposal to the Chan Zuckerberg Initiative’s (CZI) Essential Open Source Software for Science (EOSS) grant program. The proposal aimed to accelerate Zarr’s development on issues often too significant to tackle through volunteers’ contributions. Some of the high-level goals we focused on using the grant were API unification across open-source projects like NumPy, Dask, Xarray, project maturity, and efficient community engagement. The Zarr Community along with the Zarr Steering Council spent almost a year working towards these goals, and we’re proud to say that we’ve made significant progress.
As promised in the last blog, I will talk about what we’ve accomplished so far apart from ZEPs and what the upcoming months for the Zarr project and the community look like. Also, I’ll shed some light on the deliverables we’ve completed under the CZI EOSS4 grant.
API Unification
The Zarr format lets you store big-size arrays into small compressed chunks, making collaborations with various array-providing projects like NumPy and Dask a must. API unification plays a crucial role in interoperability. This will allow the OSS community to transparently choose between implementations making algorithms more generalisable and scalable.
We identified several discrepancies between Zarr and related projects (NumPy and Dask) and corrected them. Juan Nunez-Iglesias worked on adding support for fancy indexing, and Ben Jeffrey fixed indexing for scalar NumPy values. See zarr-python #725 and zarr-python #974 respectively.
Mads R.B. Kristensen worked on adding support for multiple array types. See numcodecs #305. If you know of other ways that we could make Zarr work more cleanly with Dask, NumPy or other array APIs, please let us know. (How?)
Xarray / NetCDF Interoperability
NetCDF (a long-time provider of stable file formats) and Xarray (N-D labelled arrays) have been updated to support each other’s representation of named dimensions. Mattia Almansi worked on adding support from Xarray’s side see xarray #6420 and Dennis Heimbigner worked from NetCDF’s side, see netcdf-c #2257. Also, both projects have agreed to discuss a common Zarr Specification, and a proposal is being drafted for a common standard for named dimensions.
Multiscale array representation
The ‘datatree’ library by Thomas Nicholas and supported by B-open can now be used to represent a pyramid of related arrays and has been proposed as a standard data structure. Also, bioimaging users from ITK have tested the data structure, and discussions have begun for integration into Napari.
These goals mainly focus on Zarr’s technical development, which revolves around working collaboratively with critical open-source projects in the array storage ecosystem. We will continue working towards strengthening the bridges of interoperability with other projects in the upcoming months.
In this section, I’ll mainly be talking about the community engagement part of Zarr. For my part, I’ve focused on:
The first and foremost thing I did when I started my role was to relaunch the Zarr Blog over at the new URL: https://zarr.dev/blog. The newly launched blog post contains blog posts regarding releases, ZEPs and any further event/information vital for the Zarr community. I also worked on revamping Zarr’s webpage, which is at https://zarr.dev/. Currently, I’m asynchronously working on a new website for Zarr and if you have any thoughts feel free to share them with me.
Zarr is participating in Google Summer of Code for the first time this year. We made a list of exhaustive potential project lists, which can be seen here. After going through several applications, we shortlisted Shivank Chaudhary and Parth Tripathi to work on Building Codecs Registry and Benchmarking Zarr Implementations respectively. I believe participating in open-source programs led by organisations is an excellent way to invite and collaborate with new contributors.
We also worked on increasing participation in conferences and meet-ups. For example, I spoke about Zarr at Open Geospatial Conference along with Ryan Abernathey. I also presented at my local PyData chapter and was elated to see the engaging interaction with the community.
Zarr V2 is now an OGC Standard thanks to efforts led by Ryan Abernathey.
Apart from physically reaching out to the community, we also worked on our social media presence by actively tweeting and blogging about Zarr.
The Zarr community needed a structural process to handle incoming changes to the Zarr Specification and accelerate the development of Zarr Specification V3. This led to the inception of ZEPs and ZIC.
We made new stickers for the project, and I was thrilled when they were delivered. We’ve already distributed many of them and will give them in future meetings.
We achieved a few high-level goals that would help strengthen and bring the Zarr community close. Apart from these, I’ve also been assisting with Zarr-Python releases, managing community calls, regular maintenance of Zarr repositories and working closely with various Zarr Implementations.
I’m very excited and looking forward to Zarr’s future. Having a systematic process in place and a dedicated community manager has streamlined the technical and community development for Zarr and its various implementations. Since ZEP0001 is in its initial review phase, we believe that the implementation of Zarr V3 is the next potential and upcoming change. In upcoming months, we will be focusing on:
In conclusion, our first year with CZI EOSS4 grant has achieved some important milestones. We solved some of the crucial technical and community problems which have paved a smooth path for further development. We believe the upcoming progress will be in streamlined and much more systematic manner.
As for me, it’s been six months since I started working with the wonderful humans of Zarr, and every day I get to learn something new in terms of community engagement, technical skills or as simple as talking and teaching about Zarr to a group of humans. I believe that the future of Zarr looks promising and there are many more exciting things yet to come!
Thanks for reading this blog post. If you’d like to contribute to Zarr in any manner feel free to ping me or drop a ‘Hi🙋🏻♂️’ over at our Gitter channel. Talk to you soon!
~Sanket Verma
]]>It’s been a long time since I’ve talked to you, and I think there wouldn’t be a better time to do it, along with the announcement of our newly created community feedback process, which I’ve been working on for quite some time now. It’s been fun and intensive learning since I started working on the ZEPs.
So first, I’ll discuss the motivation and need for a process for the Zarr community. After that, I’ll do a walk-through of different stages during its creation.
But first, I’d like to share my experience working with the Zarr community briefly. June 2022 marks the completion of 5 months of me working with the fantastic humans of the Zarr Steering Council, the Zarr community and the overall open-source space.
These last few months have been full of learning new technical and interpersonal skills, understanding Zarr deeply, making great new friends, meeting and interacting with a diverse and vibrant community, and speaking about Zarr at conferences and local meet-ups. So here’s me talking about Zarr at OGC and PyData Delhi.
Before joining Zarr, I was under the impression that Zarr was a young project which needed someone with experience in handling communities. When I say young I thought the Zarr community was still in the early growth phase. But after a few weeks of joining Zarr, I realised how large, sophisticated and mature the Zarr community is. Handling a large community has its challenges, and I needed to put together the somewhat scattered pieces of the puzzle together to understand the big picture (a.k.a. the community). That helped me better understand the needs and necessary steps to manage the community and ensure the constructive growth of the community.
Out of all the things I’ve done so far, creating ZEP (Zarr Enhancement Proposal) with the community’s help is one of the crucial and somewhat challenging tasks I’ve encountered in my time here.
I plan to cover my other achievements in successive blogs. However, I’ll mostly talk about the ZEP in this blog.
I’d first like to talk about the motivation for having a community feedback process for the Zarr project and its community. The Zarr community is significant and touches the users & developers of various programming languages like Python, Julia, Javascript, C++ etc. All of these languages have successfully developed an implementation of Zarr based on the V2 Specification.
The users and developers of these implementations have diverse needs and expectations from the Zarr specification. I’ve seen in the community calls how folks come up with an idea/feature/request, if implemented in Zarr, would greatly benefit the broader community. But most of the time, these proposals are vague and lack proper motivation and, most notably, a well-defined process to execute and implement them.
Talking with Alistair
In February 2022, I was on a call with Alistair Miles (author of Zarr), and we briefly discussed how, in the past, the members of the Zarr community came up with an idea to change the spec. Alistair’s general concern was that their ideas/proposals were remotely not connected to the problems they were facing. In simpler words, the ideas and the problems were not entirely related. Also, sometimes the discussion from one community call doesn’t necessarily get carried over to the successive call due to the absence of participation/lack of interest/poor presentation, which halts the discussion on what could’ve been a potentially good proposal to the specification.
It’s not every time that the idea/proposal is not good. We came across some good proposals, and after establishing the community’s consensus, we moved forward. But converting the ideas into proposals and then implementing them to the Zarr specification along with a minimalistic working POC requires a ton of work. For the time being, the Zarr Steering Council was shepherding this, but IMHO that’s not sustainable for an ever-expanding community like Zarr.
Considering what I’ve said above, we were in dire need of a process that would not only have a well-defined manner of handling the incoming changes to the specification but also focus on involving the community and stakeholders in the decision-making process.
As time went by, the need for having a process only increased. The need for a community feedback process was strongly evident from a few things we saw in the previous months. So, I’d like to briefly talk about sharding and the evolvement of Zarr Specification V2 to V3.
Adding Sharding Support
The addition of sharding to Zarr was proposed here and here. By going through the discussions over these GitHub issues, we can see that the Zarr community favours having sharding support added to the Zarr Specification. Also, upon further reading and looking upon Ryan’s comment here, we see that the author for sharding (Jonathon Striebel, Scalable Minds) is not clear on how does introduction of a significant change to the spec works which IMO was quite disheartening to see.
Evolvement of Specification V2 → V3
Speaking about Spec V2, the Zarr Specification V2 has been widely adopted and implemented in several programming languages. It has enabled the use of cloud and distributed computing to process various large and challenging datasets, particularly in the scientific domain. However, as the usage of Zarr has grown and broadened, several limitations of Zarr V2 have surfaced. Some of them are:
Interoperability: Zarr V2 has been implemented in Python, C, C++, Julia, Java and JavaScript. However, there is no feature parity across all implementations. This is partly because the Zarr V2 spec was originally developed with the Python implementation, which leans heavily on NumPy concepts.
High latency storage: Zarr V2 was originally developed to support local file system storage. Because of this, the design of Zarr V2 implicitly made assumptions about the performance characteristics of the underlying storage technology, such as low latency for storage I/O.
And other limitations like extensibility and storage layout flexibility. These limitations have motivated the need for the evolvement of Zarr Specification V3. Evolving the spec to a newer version needs heavy involvement from the users and developers of Zarr, the core devs of several implementations, the Zarr Steering Council and the overall Zarr community.
We’ve also seen that scattered discussions on several GitHub issues/PRs in the past have led to delayed responses, loss of interest and unexpected outcomes. The process would help facilitate the discussions in a well-defined manner where it would be easy for everyone to follow and participate in it. We primarily aim to use GitHub to conduct discussions on the proposals where the Zarr community could participate asynchronously.
I’ve mentioned some of the many issues the community faced over the past several months, and to me, having a well-defined structured process in place would mitigate these issues.
After talking with the Zarr Steering Council multiple times, hearing what the Zarr community had to say and going through past GitHub issues/PRs, now was time for me to take over and start working on the formation of a community feedback process for Zarr, which would later be named as Zarr Enhancement Proposal also abbreviated as ZEP. In this section, I’d try to walk through what I read, learnt, and did over months to formulate the ZEP. I’d try to limit this section and highlight only essential parts; otherwise, this would go on and on. But if you’re reading this and are interested in learning more, feel free to ping me. I’d be happy to discuss it with you or help you make one for your project/team/organisation.
Experience with processes
Firstly I’d talk very briefly about my experience with creating processes. If you have gone through my first blog here, you’ll notice that I’ve been heavily involved with PyData and NumFOCUS. In the past, working with them, I helped establish a particular process for the program committees at PyData conferences which helped us in a fair evaluation of proposals we received during the CFP period. I’ve also been taking care of the local PyData chapter in my hometown Delhi, where I had the opportunity to create well-defined systems to run the monthly meet-ups, collaborate with organisations around me, invite speakers etc. I very much liked doing all of it and learnt a great deal about the organisation and creating a system from the ground up when needed. But ZEP was completely different from what I had done so far, so it came with a challenge. I’ll walk you through the ZEP process in several small sections, and each of them will essentially emphasise different stages of development of the ZEP. Here we go:
Reading and understanding existing community process
I’ve been using Python for almost six years now and am familiar with PEP. I’ve gone through various PEPs in the past and sometimes controversial PEPs, too, like PEP 572, but I never got a chance to read PEP 1. PEP 1 lays out and defines the working and fundamentals of the Python Enhancement Proposal (PEP). PEP is beautifully written and covers some crucial sections like its types, workflow, review & resolution, templates etc. PEP helped me understand the structural requirement for developing a robust process. I want to extend my warm thanks to PEP’s authors for this. After this, I moved on to understanding NEP and STAC. NEP is somewhat similar to PEP, but it is closer to what we wanted as PEP is for a programming language and NEP is for an open-source project. NEP is like a simpler version of PEP, focusing on various sections essential for an open-source project, not a programming language. NEP helped me understand what to include and what to discard for ZEP. After digesting NEP, I moved to STAC. STAC stands for SpatioTemporal Asset Catalog, a specification for a common language to describe geospatial information. I liked that STAC’s process to handle incoming changes is interweaved with the specification. STAC uses Git and GitHub to their full potential to achieve its goal.
After I was done reading, I understood that ZEP needs to be somewhat between PEP, NEP and STAC. The system, documentation and working of PEP are something I wanted to adopt. PEP has proven a successful system for Python for a long time. But on the other hand, PEP is made for a language, so I needed to discard a few items from it. This is where NEP proved helpful. The Zarr specification is on a GitHub repo and uses Git for its workflows which made complete sense to use GitHub for handling incoming PRs to the spec, and this is what STAC does. So, if I had to summarise what I said above, I’d like to draw your attention to the Venn diagram below:
PEP as starting point and evolving it to ZEP
Initially, I used PEP for the first draft of ZEP, which can be seen on the GitHub pull request here. The first draft wasn’t the final one, but it helped me gather feedback from the Zarr community on what was missing, what was needed, and what was expected from the ZEP. The two critical changes incorporated after the feedback were:
I also took care of minor changes like defining governance process around less-interesting ZEPs or the ZEPs where the consensus would be difficult to achieve, removal of Standard Track ZEPs etc. If you’re interested in going through the commit history and the full conversation, please refer here.
Different types of ZEPs
After lengthy discussions, the community agreed on having the following ZEP types. Every category has its importance for the community:
Governance process
The governance process for ZEP needed to be different from what we’ve seen in other processes. The Zarr community operates differently from other language and open-source projects. The versatility of Zarr Specification and being open-source have enabled implementations in several programming languages. We wanted the core developers of these implementations to be included in the decision-making process of ZEPs. Since these implementations have followed the specification during the development phase, we didn’t want to accept a ZEP that would prove beneficial for one implementation and breaking change for another. The Zarr Steering Council plays a vital role in the workflow of ZEP, and they should unanimously approve the ZEPs. Also, there should be no vetos from the Zarr Implementations Council (which essentially consists of core developers of several Zarr implementations). Please read more about the governance of ZEPs here.
Zarr Implementations Council a.k.a. ZIC
The invites were sent to all of the implementations, which can be seen here. I’m happy to express that almost all of the invites were accepted by the implementations, and a diverse group of humans represents the ZIC. These wonderful humans would help us further develop Zarr Specification, respective open-source implementations and the overall Zarr community. Please refer to the full list of ZIC members here.
The next step was to set up a website to render and display the incoming ZEPs.
ZEP Website
After the PR was merged, I started working on the website for ZEP. The website shows the active and draft ZEPs as of now. ZEP 0 lives under active ZEPs as it’s a process ZEP. Once we accept draft ZEPs, they’ll be moved under the accepted ZEPs section. Feel free to browse the website here: https://zarr.dev/zeps/.
So, this was my journey in the formulation of the ZEP process. Over the past few months, I’ve learned much about Zarr, its community and the project, and I look forward to solving interesting problems and helping Zarr.
I want to extend my special thanks to Josh Moore, Alistair Miles, Ryan Abernathey and John A. Kirkham from the Zarr Steering Council and the Zarr Community for helping me along the way.
Thanks for reading my blog. I hope I was able to convey my thoughts in a clear and structured manner. If you like this blog, feel free to share it on social media and please mention Zarr too.
I’d appreciate any feedback for this blog, the website or anything remotely related to Zarr. You can reach me at svsanketverma5@gmail.com. See you soon! ✌🏻
~Sanket Verma
]]>2.12
are now available! 🎉
This blog post aims to overview new features, especially newly added
experimental support for reading and writing to Zarr V3, the upcoming
format for storing N-dimensional chunked compressed data.
This blog also highlights other enhancements like
creating FSStore
from an existing fsspec filesystem, performance
improvement for Zarr arrays when appending data to S3, bug fixes,
documentation and a maintenance fix.
Zarr Python 2.12 provides experimental infrastructure for reading and writing
the upcoming V3 spec of the Zarr format. Users wishing to prepare for the
migration can set the environment variable ZARR_V3_EXPERIMENTAL_API
to begin
experimenting, however data written with this API should not yet be considered
final.
The new zarr._store.v3
package has the necessary classes and functions for
evaluating Zarr V3. Since the design is not finalised, the classes and
functions are not automatically imported into the regular Zarr namespace.
The pre-release can be installed via: pip install --pre zarr
.
How to create arrays using Zarr V3:
ZARR_V3_EXPERIMENTAL_API=1
to your shell:Type this in your terminal:
export ZARR_V3_EXPERIMENTAL_API=1
>>>import zarr
>>>z = zarr.create((10000, 10000),
chunks=(100, 100),
dtype='f8',
compressor='default',
path='path-where-you-want-zarr-v3-array',
zarr_version=3)
z.info
to see details about the array you just created:>>>z.info
Name : path-where-you-want-zarr-v3-array
Type : zarr.core.Array
Data type : float64
Shape : (10000, 10000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr._storage.v3.KVStoreV3
No. bytes : 800000000 (762.9M)
No. bytes stored : 557
Storage ratio : 1436265.7
Chunks initialized : 0/10000
You can also check Store type
here (which indicates Zarr V3).
We see Chunks initialized: 0/10000 because we haven’t written anything to our arrays yet. The chunks will be initialized when we start writing data to the arrays.
There have been significant changes to Zarr’s Python codebase to implement V3 functionality. Highlights of the main changes include:
store.py
, which verifies that a key conforms to
the V3 specification.store.py
to ensure internally that Zarr stores are always
a class with a specific interface derived from Store
, which is slightly
different from MutableMapping
.metadata
files from the data (arrays). Previously metadata and
arrays were stored together in a consolidated group known as .zgroup
.convenience.py
to use Zarr V3. The default value is None
; it
will attempt to infer the version from store
if possible; otherwise, it
will fall back to V2.convenience.py
.creation.py
, which enables the creation of an array using
Zarr V3. If None
, it will be inferred from store
or chunk_store
;
otherwise defaults to V2.meta.py
with the new V3 data types links. The V3 data types are
listed here.If you’re interested in browsing through all of the code changes, please refer to PR #898.
The work on V3 was done by Gregory Lee and was funded by the CZI. The Zarr Community extends their wholesome gratitude to Gregory for completing this! 🙌
The old implementation iterated through all the old
chunks and removed those
that didn’t exist in the new
chunks. As a result, it led to significant time
delays when appending data to Zarr arrays in cloud services like S3.
The new and improved implementation will iterate through each dimension and
only find and remove the chunk slices in old
but not in new
data. It also
introduced a mutable list to dynamically adjust the number of chunks along the
already-processed dimensions to avoid duplicate chunk removal.
This improvement was added by hailiangzhang with PR #1014.
FSStore
.This feature was added by Ryan Abernathy with PR #911.
json.dumps
to support NumPy integers in chunks
arguments.This enhancement was added by Eric Prestat with PR #933 and the issue was raised by Mark Dickinson with #697.
Fix bug that made it impossible to create an FSStore on unlistable filesystems (e.g. some HTTP servers) by Ryan Abernathey; #993.
Update resize doc to clarify surprising behaviour by hailiangzhang; #1022.
Pre-commit configuration now includes YAML
check by Shivank
Chaudhary;
#1015 &
#1016.
Details on these features as well as the full list of all changes in
2.12.0a2
& 2.12.0a1
are available on the release notes.
Before the pre-release version 2.12.0a2
and 2.12.0a1
there were releases
2.11.1
,
2.11.2
, &
2.11.3
from Zarr
Python package. A special shout-out to all the contributors who made previous
releases possible:
Also, a huge thanks to the contributors who made the current version 2.12.0a2
and 2.12.0a1
possible! 🙌🏻
If you find the above features useful and end up using them, please mention @zarr_dev on Twitter and tweet using #ZarrData, and we’ll make sure to get it featured! ✌🏻
~Sanket Verma
]]>