Pre-release of Zarr Python 2.12
Pre-release versions of
Zarr Python 2.12
are now available! 🎉
This blog post aims to overview new features, especially newly added
experimental support for reading and writing to Zarr V3, the upcoming
format for storing N-dimensional chunked compressed data.
This blog also highlights other enhancements like
creating FSStore
from an existing fsspec filesystem, performance
improvement for Zarr arrays when appending data to S3, bug fixes,
documentation and a maintenance fix.
Add support for reading and writing Zarr V3
Zarr Python 2.12 provides experimental infrastructure for reading and writing
the upcoming V3 spec of the Zarr format. Users wishing to prepare for the
migration can set the environment variable ZARR_V3_EXPERIMENTAL_API
to begin
experimenting, however data written with this API should not yet be considered
final.
The new zarr._store.v3
package has the necessary classes and functions for
evaluating Zarr V3. Since the design is not finalised, the classes and
functions are not automatically imported into the regular Zarr namespace.
The pre-release can be installed via: pip install --pre zarr
.
How to create arrays using Zarr V3:
- First, you need to export the
ZARR_V3_EXPERIMENTAL_API=1
to your shell:
Type this in your terminal:
export ZARR_V3_EXPERIMENTAL_API=1
- Here’s a small code snippet for creating V3 arrays:
>>>import zarr
>>>z = zarr.create((10000, 10000),
chunks=(100, 100),
dtype='f8',
compressor='default',
path='path-where-you-want-zarr-v3-array',
zarr_version=3)
- Further, you can use
z.info
to see details about the array you just created:
>>>z.info
Name : path-where-you-want-zarr-v3-array
Type : zarr.core.Array
Data type : float64
Shape : (10000, 10000)
Chunk shape : (100, 100)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr._storage.v3.KVStoreV3
No. bytes : 800000000 (762.9M)
No. bytes stored : 557
Storage ratio : 1436265.7
Chunks initialized : 0/10000
You can also check Store type
here (which indicates Zarr V3).
We see Chunks initialized: 0/10000 because we haven’t written anything to our arrays yet. The chunks will be initialized when we start writing data to the arrays.
There have been significant changes to Zarr’s Python codebase to implement V3 functionality. Highlights of the main changes include:
- A new function is added in
store.py
, which verifies that a key conforms to the V3 specification. - Added function in
store.py
to ensure internally that Zarr stores are always a class with a specific interface derived fromStore
, which is slightly different fromMutableMapping
. - Separating
metadata
files from the data (arrays). Previously metadata and arrays were stored together in a consolidated group known as.zgroup
. - Changes in
convenience.py
to use Zarr V3. The default value isNone
; it will attempt to infer the version fromstore
if possible; otherwise, it will fall back to V2. - Consolidating all metadata for groups and arrays within the given store into
a single resource and putting it under the given key. The changes can be seen
here
in
convenience.py
. - Modification in
creation.py
, which enables the creation of an array using Zarr V3. IfNone
, it will be inferred fromstore
orchunk_store
; otherwise defaults to V2. - Updated
meta.py
with the new V3 data types links. The V3 data types are listed here. - New tests added for all the new and modified features!
If you’re interested in browsing through all of the code changes, please refer to PR #898.
The work on V3 was done by Gregory Lee and was funded by the CZI. The Zarr Community extends their wholesome gratitude to Gregory for completing this! 🙌
Appending performance improvement
The old implementation iterated through all the old
chunks and removed those
that didn’t exist in the new
chunks. As a result, it led to significant time
delays when appending data to Zarr arrays in cloud services like S3.
The new and improved implementation will iterate through each dimension and
only find and remove the chunk slices in old
but not in new
data. It also
introduced a mutable list to dynamically adjust the number of chunks along the
already-processed dimensions to avoid duplicate chunk removal.
This improvement was added by hailiangzhang with PR #1014.
Other enhancements
- If you have created a fsspec filesystem outside of Zarr, you can now pass it
as a keyword argument to
FSStore
.
This feature was added by Ryan Abernathy with PR #911.
- Added number encoder for
json.dumps
to support NumPy integers inchunks
arguments.
This enhancement was added by Eric Prestat with PR #933 and the issue was raised by Mark Dickinson with #697.
Bugs, Documentation and Maintenance
Fix bug that made it impossible to create an FSStore on unlistable filesystems (e.g. some HTTP servers) by Ryan Abernathey; #993.
Update resize doc to clarify surprising behaviour by hailiangzhang; #1022.
Pre-commit configuration now includes YAML
check by Shivank
Chaudhary;
#1015 &
#1016.
More information
Details on these features as well as the full list of all changes in
2.12.0a2
& 2.12.0a1
are available on the release notes.
Appreciation 🙌🏻
Before the pre-release version 2.12.0a2
and 2.12.0a1
there were releases
2.11.1
,
2.11.2
, &
2.11.3
from Zarr
Python package. A special shout-out to all the contributors who made previous
releases possible:
Also, a huge thanks to the contributors who made the current version 2.12.0a2
and 2.12.0a1
possible! 🙌🏻
If you find the above features useful and end up using them, please mention @zarr_dev on Twitter and tweet using #ZarrData, and we’ll make sure to get it featured! ✌🏻
~Sanket Verma