Zarr protocol v3 design update
Today I put together some slides summarising the current state of exploratory work on the Zarr v3 protocol spec. The purpose of this blog post is to share th...
Recently we released version 2.3 of the Python Zarr package, which implements the Zarr protocol for storing N-dimensional typed arrays, and is designed for use in distributed and parallel computing. This post provides an overview of new features in this release, and some information about future directions for Zarr.
A key feature of the Zarr protocol is that the underlying storage
system is decoupled from other components via a simple key/value
interface. In Python, this interface corresponds to the
MutableMapping
interface,
which is the interface that Python
dict
implements. I.e., anything dict
-like can be used to store Zarr
data. The simplicity of this interface means it is relatively
straightforward to add support for a range of different storage
systems. The 2.3 release adds support for storage using SQLite, Redis, MongoDB and Azure Blob Storage.
For example, here’s code that creates an array using MongoDB:
To do the same thing but storing the data in the cloud via Azure
Blob Storage, replace the instantiation of the store
object with:
Support for other cloud object storage storage services was already available via other packages, with Amazon S3 supported via the s3fs package, and Google Cloud Storage supported via the gcsfs package. Further notes on using cloud storage are available from the Zarr tutorial.
The attraction of cloud storage is that total I/O bandwidth scales linearly with the size of a computing cluster, so there are no technical limits to the size of the data or computation you can scale up to. Here’s a slide from a recent presentation by Ryan Abernathey showing how I/O scales when using Zarr over Google Cloud Storage:
One issue with using cloud object storage is that, although total I/O throughput can be high, the latency involved in each request to read the contents of an object can be >100 ms, even when reading from compute nodes within the same data centre. This latency can add up when reading metadata from many arrays, because in Zarr each array has its own metadata stored in a separate object.
To work around this, the 2.3 release adds an experimental feature to consolidate metadata for all arrays and groups within a hierarchy into a single object, which can be read once via a single request. Although this is not suitable for rapidly changing datasets, it can be good for large datasets which are relatively static.
To use this feature, two new convenience functions have been
added. The
consolidate_metadata()
function performs the initial consolidation, reading all metadata and
combining them into a single object. Once you have done that and
deployed the data to a cloud object store, the
open_consolidated()
function can be used to read data, making use of the consolidated
metadata.
Support for the new consolidated metadata feature is also now available via xarray and intake-xarray (see this blog post for an introduction to intake), and many of the datasets in Pangeo’s cloud data catalog use Zarr with consolidated metadata.
Here’s an example of how to open a Zarr dataset from Pangeo’s data catalog via intake:
…and here’s the underlying catalog entry.
Around the same time that development on Zarr was getting started, a separate team led by Stephan Saafeld at the Janelia research campus was experiencing similar challenges storing and computing with large amounts of neural imaging data, and developed a software library called N5. N5 is implemented in Java but is very similar to Zarr in the approach it takes to storing both metadata and data chunks, and to decoupling the storage backend to enable efficient use of cloud storage.
There is a lot of commonality between Zarr and N5 and we are working jointly to bring the two approaches together. As a first experimental step towards that goal, the Zarr 2.3 release includes an N5 storage adapter which allows reading and writing of data on disk in the N5 format.
Zarr is intended to work efficiently across a range of different storage systems with different latencies and bandwidth, from cloud object stores to local disk and memory. In many of these settings, making efficient use of local memory, and avoiding memory copies wherever possible, can make a substantial difference to performance. This is particularly true within the Numcodecs package, which is a companion to Zarr and provides implementations of compression and filter codecs such as Blosc and Zstandard. A key aspect of achieving fewer memory copies has been to leverage the Python buffer protocol.
The Python buffer protocol is a specification for how to share large blocks of memory between different libraries without copying. This protocol has evolved over time from its original introduction in Python 2 and later revamped implementation added in Python 3 (with backports to Python 2.6 and 2.7). Due to the changes in its behavior from Python 2 to Python 3 and what objects supported which implementation of the buffer protocol, it was a bit challenging to leverage effectively in Zarr.
Thanks to some under-the-hood changes in Zarr 2.3 and Numcodecs 0.6, the buffer protocol is now cleanly supported for Python 2/3 in both libraries when working with data. In addition to improved memory handling and performance, this should make it easier for users developing their own stores, compressors, and filters to use with Zarr. Also it has cutdown on the amount of code specialized for handling different Python versions.
There is a growing community of interest around new approaches to storage of array-like data, particularly in the cloud. For example, Theo McCaie from the UK Met Office Informatics Lab recently wrote a series of blog posts about the challenges involved in storing 200TB of “high momentum” weather model data every day. This is an exciting space to be working in and we’d like to do what we can to build connections and share knowledge and ideas between communities. We’ve started a regular teleconference which is open to anyone to join, and there is a new gitter channel for general discussion.
The main focus of our conversations so far has been setting up work towards development of a new set of specifications that support the features of both Zarr and N5, and provide a platform for exploration and development of new features, while also identifying a minimal core protocol that can be implemented in a range of different programming languages. It is still relatively early days and there are lots of open questions to work through, both on the technical side and in terms of how we organise and coordinate efforts. However, the community is very friendly and supportive, and anyone is welcome to participate, so if you have an interest please do consider getting involved.
If you would like to stay in touch with or contribute to new developments, keep an eye on the zarr and zarr-specs GitHub repositories, and please feel free to raise issues or add comments if you have any questions or ideas.
If you’re coming to SciPy this year, we’re very pleased to be giving a talk on Zarr on day 1 of the conference (Wednesday 10 July). Several members of the Zarr community will be at the conference, and there are sprints going on after the conference in a number of related areas, including an Xarray sprint on the Saturday. Please do say hi or drop us a comment on this issue if you’d like to connect and discuss anything.
Blog post written by Alistair Miles and John Kirkham.
Today I put together some slides summarising the current state of exploratory work on the Zarr v3 protocol spec. The purpose of this blog post is to share th...
Recently we released version 2.3 of the Python Zarr package, which implements the Zarr protocol for storing N-dimensional typed arrays, and is designed for u...