Pre-release versions of Zarr Python 2.12 are now available! 🎉

This blog post aims to overview new features, especially newly added experimental support for reading and writing to Zarr V3, the upcoming format for storing N-dimensional chunked compressed data. This blog also highlights other enhancements like creating FSStore from an existing fsspec filesystem, performance improvement for Zarr arrays when appending data to S3, bug fixes, documentation and a maintenance fix.

Add support for reading and writing Zarr V3

Zarr Python 2.12 provides experimental infrastructure for reading and writing the upcoming V3 spec of the Zarr format. Users wishing to prepare for the migration can set the environment variable ZARR_V3_EXPERIMENTAL_API to begin experimenting, however data written with this API should not yet be considered final.

The new zarr._store.v3 package has the necessary classes and functions for evaluating Zarr V3. Since the design is not finalised, the classes and functions are not automatically imported into the regular Zarr namespace.

The pre-release can be installed via: pip install --pre zarr.

How to create arrays using Zarr V3:

  • First, you need to export the ZARR_V3_EXPERIMENTAL_API=1 to your shell:

Type this in your terminal:

export ZARR_V3_EXPERIMENTAL_API=1

  • Here’s a small code snippet for creating V3 arrays:
>>>import zarr
>>>z = zarr.create((10000, 10000),
                    chunks=(100, 100),
                    dtype='f8',
                    compressor='default',
                    path='path-where-you-want-zarr-v3-array',
                    zarr_version=3)
  • Further, you can use z.info to see details about the array you just created:
>>>z.info
Name               : path-where-you-want-zarr-v3-array
Type               : zarr.core.Array
Data type          : float64
Shape              : (10000, 10000)
Chunk shape        : (100, 100)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr._storage.v3.KVStoreV3
No. bytes          : 800000000 (762.9M)
No. bytes stored   : 557
Storage ratio      : 1436265.7
Chunks initialized : 0/10000

You can also check Store type here (which indicates Zarr V3).

We see Chunks initialized: 0/10000 because we haven’t written anything to our arrays yet. The chunks will be initialized when we start writing data to the arrays.

There have been significant changes to Zarr’s Python codebase to implement V3 functionality. Highlights of the main changes include:

  • A new function is added in store.py, which verifies that a key conforms to the V3 specification.
  • Added function in store.py to ensure internally that Zarr stores are always a class with a specific interface derived from Store, which is slightly different from MutableMapping.
  • Separating metadata files from the data (arrays). Previously metadata and arrays were stored together in a consolidated group known as .zgroup.
  • Changes in convenience.py to use Zarr V3. The default value is None; it will attempt to infer the version from store if possible; otherwise, it will fall back to V2.
  • Consolidating all metadata for groups and arrays within the given store into a single resource and putting it under the given key. The changes can be seen here in convenience.py.
  • Modification in creation.py, which enables the creation of an array using Zarr V3. If None, it will be inferred from store or chunk_store; otherwise defaults to V2.
  • Updated meta.py with the new V3 data types links. The V3 data types are listed here.
  • New tests added for all the new and modified features!

If you’re interested in browsing through all of the code changes, please refer to PR #898.

The work on V3 was done by Gregory Lee and was funded by the CZI. The Zarr Community extends their wholesome gratitude to Gregory for completing this! 🙌

Appending performance improvement

The old implementation iterated through all the old chunks and removed those that didn’t exist in the new chunks. As a result, it led to significant time delays when appending data to Zarr arrays in cloud services like S3.

The new and improved implementation will iterate through each dimension and only find and remove the chunk slices in old but not in new data. It also introduced a mutable list to dynamically adjust the number of chunks along the already-processed dimensions to avoid duplicate chunk removal.

This improvement was added by hailiangzhang with PR #1014.

Other enhancements

  • If you have created a fsspec filesystem outside of Zarr, you can now pass it as a keyword argument to FSStore.

This feature was added by Ryan Abernathy with PR #911.

  • Added number encoder for json.dumps to support NumPy integers in chunks arguments.

This enhancement was added by Eric Prestat with PR #933 and the issue was raised by Mark Dickinson with #697.

Bugs, Documentation and Maintenance

Fix bug that made it impossible to create an FSStore on unlistable filesystems (e.g. some HTTP servers) by Ryan Abernathey; #993.

Update resize doc to clarify surprising behaviour by hailiangzhang; #1022.

Pre-commit configuration now includes YAML check by Shivank Chaudhary; #1015 & #1016.

More information

Details on these features as well as the full list of all changes in 2.12.0a2 & 2.12.0a1 are available on the release notes.

Appreciation 🙌🏻

Before the pre-release version 2.12.0a2 and 2.12.0a1 there were releases 2.11.1, 2.11.2, & 2.11.3 from Zarr Python package. A special shout-out to all the contributors who made previous releases possible:

Also, a huge thanks to the contributors who made the current version 2.12.0a2 and 2.12.0a1 possible! 🙌🏻

If you find the above features useful and end up using them, please mention @zarr_dev on Twitter and tweet using #ZarrData, and we’ll make sure to get it featured! ✌🏻

~Sanket Verma