ZEP 9 — Zarr Extension Naming
Authors:
- Norman Rzepka, scalable minds
- Josh Moore, German BioImaging
Status: Draft
Type: Specification
Created: 2025-01-21
Abstract
The Zarr v3 specification is unclear on several matters regarding the definition of extensions in the zarr.json
metadata. This proposal defines immediate clarifications for the specification which will allow the centralized creation of “raw” names as well as the decentralized use of URIs for extensions. At the same time, we propose a longer-term strategy for evolving extensions and extension points for discussion and potential approval by the community to be rolled out to implementations at a future time.
Introduction
The intention of the Zarr v3 spec was to provide a framework where community-driven extensions can be created and managed by the community itself. However, the process for that was unclear and, therefore, implicitly linked to the ZEP process. Issues have been surfaced by users, who wanted to create and use new extensions. For example, users wanted to create new codecs, data types and an extension for consolidated metadata. We thank the members of the community for their efforts in discovering and communicating these issues.
The underlying problem is that the extension mechanism for v3 lacks detail in its definition. While several extension points are defined in the core spec, there is no advice about selecting names for these extensions or how naming conflicts are avoided. Additionally, there is no mechanism for defining new extensions that do not fit into the existing extension points. In fact, there are contradictions in what is entirely permissible, including for example whether codecs are within or required by the core v3 spec.
Implementations have started to use the v3 spec and are making use of extensions (e.g numcodecs.*
codecs and zstd
in zarr-python), which means that any changes made to the v3 spec need to be compatible with the current reality in the community.
Proposal
This proposal has three phases.
The first phase clarifies that extensions can be created by the community without the ZEP process. Extensions with URI-based names can be created without any further coordination and extensions with raw names only need to get the name registered in the zarr-extensions
Github repository. The process for registering will be a lightweight PR-based process.
The naming mechanism as well as clarifications in the core spec documented will be implemented by the ZSC through PR330 in the zarr-specs repository for discussion concurrently with this ZEP. The changes we are proposing, marked with a “🛠️” below, are interpretations of the current Zarr v3 core spec as well as additions that are in-line with the spec evolution policy of the Zarr core spec. The Zarr core spec document will be bumped to version 3.1. The metadata in the zarr.json
files will remain unchanged zarr_format=3
.
The second phase defines new extension points and therefore will follow the active ZEP process. The goal is to encourage more exploration by the Zarr community outside of what’s currently defined with the core specification.
The third phase defines a possible future evolution of the naming strategy, largely as a justification for decisions made in the first two phases. It is non-normative and can be further discussed in future ZEPs.
Revisions to ZEP 0 will be paused until we are clear about the extension mechanism and the implications for the role of the ZEP process in the future. The ZEP process remains in place for changes to the Zarr v3 core spec including the adoption of extensions into the core spec.
Definitions
- “Core” refers to the Zarr v3 core specification as defined in the document https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html and not necessarily if something is a MUST for implementations.
- “Metadata” is the Zarr metadata as stored in
zarr.json
files for arrays and groups. - “Extensions” as used in this document are components used in the metadata to define and configure how metadata are interpreted by implementations. These components include codecs, data types, chunk key encodings, chunk grids and storage transformers.
- “Extension points” are locations within metadata where extension-related metadata can be found. Current extension points are listed in the core spec, e.g.
codecs
,data_type
. As part of the second phase, we intend to add another general-purpose extension point calledextensions
. - “Extension maintainers” are a person or a team that is responsible for creating and maintaining an extension.
Phase 1: Immediate clarifications
Extension naming
🛠️ We propose defining two categories of names for immediate use by extensions: raw names and URI-based.
-
Raw names MUST be assigned within a central repository and follow the compatibility and versioning v3 stability policy. The name assignment is managed through the
zarr-extensions
Github repository, where each extension is listed and either contains a spec document or links to a spec document. Names are never unassigned or reassigned. The ZSC or by delegation a maintainer team reserves the right to refuse name assignment at its own discretion.- Example:
zstd
- Acceptd regex:
^[a-z0-9-_.]+$
- Example:
-
URI-based names can be used by anyone without further coordination though the assumption is that users reasonably “own” the URI. Users MAY make use of a persistent redirecting URL like PURL. URIs have been chosen due to their potential for being self-documenting and strongly recommend that the URL SHOULD resolve to a human-readable explanation of the extension, but implementations SHOULD NOT attempt to resolve the URL during processing. There are no guarantees in terms of versioning or compatibility. However, preserving backwards-compatibility is strongly encouraged. See the versioning section below.
- Example:
https://example.com/zarr3/consolidated-metadata
- Accepted regex:
^https?:\/\/[^/?#]+[^?#]*$
- Example:
All names currently listed in the v3 specification are raw names. URIs were mentioned in previous drafts of the v3 specification and are still referenced under the codecs section:
“Each codec must be defined via a separate specification. In order to refer to codecs in array metadata documents, each codec must have a unique identifier, which is a URI that dereferences to a human-readable specification of the codec.” (https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#specifying-codecs).
🛠️ This proposal would drop the MUST requirement on the dereferencing of the URI to a SHOULD and more widely specify that implemenations MAY use URI-based key names throughout the specification. (See community registry below for a proposal on coordinating the use of URIs.)
Extension definitions
Extensions are defined in the zarr.json
metadata either as objects or as short-hand names.
🛠️ In order to unify the general design of future extensions we would add a MUST requirement for future extension objects to adhere to the following definition.
Objects have the following keys:
{
"name": "<name>",
"configuration": { ... } # optional
}
If such an object is present, the field must_understand
is implicitly set to True
. Implementations which encounter a must_understand
extension that they have not implemented MUST raise an exception. An extension object can explicitly set must_understand=False
if implementations can ignore its presence, following the current guidelines in the v3 specification.
Instead of extension objects, short-hand names MAY continue to be used if no configuration metadata is required. They would be equivalent to extension objects with just a name
key. This is in-line with the current wording of the spec.
Extension specifications
There is no strict requirement for extensions to have a formal specification. However, for adoption in the community it is STRONGLY RECOMMENDED to write a specification.
For an extension with a raw name, the zarr-extensions repository SHOULD be used to either publish the specification directly or link to another location which does so. For extensions with URI-based names, it is RECOMMENDED to publish the specification under the URI of the extension.
Extension points
The v3 core spec defines a number of extension points for arrays and groups that can hold extensions following the above recommendations:
data_type
(array only)codecs
(array only)chunk_grids
(array only)chunk_key_encoding
(array only)storage_transformers
(array only) // extendable to groups
The Zarr v3 core spec was originally designed to not contain any extensions, such as data types, codecs etc. However, over time and through changing authorship, some extensions were, in fact, listed in the core spec document.
🛠️ We propose to acknowledge that the Zarr v3 core spec includes a few extensions that are expected to be supported by all implementations and fix the wording in the core spec document accordingly. Wording in each section will be updated to clarify whether or not an implementation MUST
support the extension point. Other extensions become “core” by being listed in the core spec document through a ZEP. The use of a “raw name” does NOT automatically make an extension “core”.
The following extensions are (currently) listed or referred to in the core spec and would be declared as “core” extensions under this proposal.
- Data types:
(u)int{8,16,32,64}
,float{32,64}
,complex{64,128}
,r*
- Codecs:
bytes
,transpose
- Chunk grids:
regular
- Chunk key encoding:
default
,v2
- Storage transformers: (none)
Example
The following example represents an Array showing many of the proposed changes described above:
{
"zarr_format": 3,
"data_type": "https://example.com/zarr/string", // URI-based name, short-hand name
"chunk_key_encoding": {
"name": "default", // core
"configuration": { "separator": "." }
},
"codecs": [
{
"name": "https://numcodecs.dev/vlen-utf8" // URI-based name
},
{
"name": "zstd", // raw name
"configuration": { ... }
}
],
"chunk_grid": {
"name": "regular", // core
"configuration": { "chunk_shape": [ 32 ] }
},
"shape": [ 128 ],
"dimension_names": [ "x" ],
"attributes": { ... },
"storage_transformers": []
}
Discussion
Community registry
Externally to this ZEP, we will work towards unification of these extension points. We propose that community register and discuss their extensions on the zarr-extensions repository. Eventually, we recommend that a maturity ranking be included in those listings as in other plugin ecosystems.
Reassigning or unassigning names
We consider it a design goal to not allow datasets written with one set of naming expectations to be unintentionally interpreted with other names by a future version of an implementation. Without further mechanics, this means that raw and URI-based names, once assigned, cannot be changed. (For a possible evolution of this naming scheme, see phase 3 below.) With this limitation, current implementations either already or quickly can be updated to support the above proposal since no additional logic is necessary beyond checking must_understand
.
Name assignment
Raw names will be assigned through the zarr-extension
repository. While the Zarr steering council will initially maintain this repository, it is intended that a community team will be formed to maintain the repository long-term.
An alternative would be to use the zarr-specs
repository to deposit spec documents for every assigned name. The process would be that extension maintainers would open a PR (with a template) to register their desired name. Extension maintainers could choose to publish their extensions spec in the repository directly or link to an externally hosted spec. Updates of specs hosted in zarr-specs
, would be updated through PRs. A zarr-specs maintainer team would be required to ensure a timely assignment of names and updates to the specs.
A third option would be to only store the names of the extensions in zarr-specs
repository and always link out to an externally hosted spec. In that case, externally hosted could also be another repo under zarr-developers
that manages some extensions through its own maintenance structure.
Review process
To register an extension with a raw name, a community member needs to open a new PR in the zarr-extensions repository.
Each extension MUST have a README.md file that describes the extension and its metadata specification. Extensions SHOULD have a schema.json file that contains the JSON schema for the metadata, if the README.md does not provide a link to an external schema. Please note that all extensions documents will be licensed under the Creative Commons Attribution 3.0 Unported License. Only open a PR if you are willing to license your extension under this license.
The PR will be reviewed by the Zarr steering council. We aim to be very open about registering extensions. The review will be done largely based on avoiding confusing extension names and preventing malicious activity as well as maintaining the formal requirements of the extensions. Extension maintainers are responsible for their extensions. Updates to the extensions will also be reviewed by the steering council.
Phase 2: New extension points
Beyond the immediate clarifications that are outlined in phase 1, we believe that additional extension points will further improve the Zarr v3 specification. Since these introduce new interfaces that implementations SHOULD be aware, we discuss them here for consideration as a ZEP.
Using must_understand
The v3 core spec allows for the addition of new metadata keys as part of spec evolution. This is defined in the “Extension Points” section of the core spec. Currently, implementations MUST parse the entire metadata and fail, if they find keys they cannot parse unless those are marked by must_understand=False
. We propose using this mechanism to evolve the spec and add the keys that are necessary to achieve a well-defined extension mechanism.
New keys in the metadata MAY contain objects that have a must_understand
key with value false
. In that case, the key may be ignored by implementations that cannot parse it. This is useful for extensions that aren’t strictly required for interaction with the data. If the new key holds a scalar value or an array or doesn’t not contain the must_understand
key, it is implictily must_understand=True
. In that case, implementations MUST fail if they cannot parse the key.
“extensions” extension point
🛠️ To provide for more flexible, immediate, and de-centralized use cases, we propose to also add another general-purpose extension point extensions
on both arrays and groups into which extensions MAY be added.
The extensions
object holds an array of extension definitions. The held array MUST either have one or more extensions or the object MUST be omitted entirely.
The key itself is implicitly must_understand=True
. Implementations MAY set must_understand=False
if they can reliably determine that all extensions are also must_understand=False
.
Additional extension points
We support the creation of additional extension points in the future but their introduction should follow the ZEP process. In general, the overall number of “core” extension points should be well-maintained and provide clear APIs which can be implemented by a large number of libraries. ZEPs are the appropriate mechanism to encourage a wide variety of opinions and consensus building.
Versioning and spec evolution
We propose leaving the versioning of the core spec unchanged. That means the value of zarr_format=3
and new keys to the metadata MUST be understood by implementations, i.e. they MUST fail if they find a key they don’t know, and all changes to the core spec MUST go through the ZEP process.
Extensions SHOULD follow the stability policy defined in the Zarr v3 core spec. However, this is not a strict requirement and extension maintainers can define their own evolution processes.
It is recommended that extensions evolve in a backwards-compatible manner without explicitly stored versions, meaning (credit to Jeremy Maitin-Shepard):
- Any metadata compatible with a previous version of the extension continues to be correctly interpreted by implementations of the new version.
- New metadata written according to the new version of the extension either: (a) is correctly interpreted by existing implementations of previous versions of the extension, or (b) causes existing implementations of previous versions of the extension to report an error and not load it.
- New metadata written according to the new version of the extension MUST not be successfully loaded by existing implementations of previous versions of the extension with an incorrect interpretation.
While it is recommended to maximize backwards-compatible changes, it is also possible to evolve the extension in an intentionally backwards-incompatible way, e.g.:
- Choose a new name, e.g. append a version number to the existing name.
- Add a new key to the configuration (like
version
) that was disallowed by the previous version of the extension, such that existing implementations will fail when they encounter it.
Example
The following example represents an Array showing many of the proposed changes described above:
{
"zarr_format": 3,
...,
"extensions": [ // new general-purpose extension point
{
"name": "https://example.com/zarr/offset", // uri-based name
"configuration": { "offset": [ 12 ] }
},
{
"name": "https://example.com/zarr/array-statistics", // uri-based name
"configuration": {
"min": 5,
"max": 12
},
"must_understand": false // optional extension
},
{
"name": "https://example.com/zarr/consolidated-metadata", // uri-based name
"configuration": { ... }
"must_understand": false // optional extension
}
],
}
Discussion
Alternatives for the extensions
extension point
This proposal contains a new general-purpose extension point extensions
, which holds an array of extensions. This design allows to have the same extension definition syntax across all extension points. It also avoids using URIs as keys in JSON metadata and reduces pollution of the top-level namespace in a zarr.json
. Thus, the addition of top-level metadata keys remains reserved to changes in the core spec. This MAY happen as part of the core spec adopting functionality of an extension.
There are alternative designs to be considered:
Top-level metadata keys
Instead of a general-purpose extension point, we could also add new top-level extension keys to the metadata.
{
"zarr_format": 3,
...
"https://example.com/zarr/offset": { "offset": [ 12 ] },
"https://example.com/zarr/array-statistics": {
"min": 5,
"max": 12
},
"https://example.com/zarr/consolidated-metadata": {
"must_understand": false,
...
}, // optional extension
...
}
In this case, there would be no explicit configuration
key within an extension definition, but instead all the keys of such a configuration would be in the object itself. This would mean that there are two separate types of extension definitions, i.e. {"name":"<name>", "configuration": {...}}
in specialized extension points (e.g. codecs
) and "<name>": {...}
for other extensions.
It would still be recommended to use an object with keys for the extension definition to allow for evolution of the extension.
In case an extension becomes adopted into the core spec, implementations wouldn’t need to be changed (only when changing the name from URI-based to raw name).
There has been some controversy about using URLs as keys in JSON metadata. However, it has also been used effectively in formats such as JSON-LD (see below).
extensions
object
Instead of an array that holds the extension definitions, we could also use an object.
{
"zarr_format": 3,
...
"extensions": {
"https://example.com/zarr/offset": { "offset": [ 12 ] },
"https://example.com/zarr/array-statistics": {
"min": 5,
"max": 12
},
"https://example.com/zarr/consolidated-metadata": {
"must_understand": false,
...
} // optional extension
},
...
}
This alternative is similar to the top-level keys, with mostly the same implications.
However, this alternative would reserve the top-level namespace for changes to the core spec and, therefore, reduce pollution of the top-level namespace.
Applicaton to subnodes
Conceptually, we propose that extensions defined on groups should be valid for their child nodes. However, the details of how an implementation should identify which extensions are active within an hierarchy are unclear. Relying on traversing the hierarchy towards the root node is undesirable from a performance point of view. By writing some metadata within the contained subgroups and arrays this could be made easier. Options for what this metadata could be include:
- A copy of the metadata
{
"extensions": [
{
"name": "https://example.com/my-extension",
"configuration": { ... full copy of the metadata ...}
}
]
}
- A reference to the metadata as part of the extension itself
{
"extensions": [
{
"name": "https://example.com/my-extension",
"configuration": {
"reference": "../.."
}
}
]
}
- A complimentary reference extension
{
"extensions": [
{
"name": "https://example.com/my-extension-ref",
"configuration": {
"reference": "../.."
}
}
]
}
- A shared or even core reference extension
{
"extensions": [
{
"name": "https://zarr.dev/extensions/parent-ref",
"configuration": {
"reference": "../.."
}
}
]
}
Phase 3: Future name evolution
The phase 1 proposal above prioritizes changes that can be made to the specification in line with the current text, aware that there are contradictions and a lack of clarity.
It does not provide, however, a mechanism for evolving the assigned names.
An extension that begins life as a URI that eventually migrates to having a raw name will require implementations to be updated to check for both values.
This section outlines why choices above were made (e.g., use of full URIs rather than a shorter alternative) as the basis for a possible pathway to evolving the naming strategy. These changes would require updating implementations and therefore would require an additional ZEP but would be backwards compatible with the clarifications above.
URIs everywhere
By having a mechanism which maps all extension names to a global URI, the two names (URI and raw) associated with an extensinon. In fact, the original name assigned to the extension becomes its permanent internal representation even if it can later be referred to by a shorter name.
To demonstrate, under this representation the example above could be made equivalent to:
{
"https://zarr.dev/v3/array/data_type": "https://example.com/zarr/string",
"https://zarr.dev/v3/array/chunk_key_encoding": {
"name": "https://zarr.dev/v3/chunk_key_encodings/default",
},
"https://zarr.dev/v3/array/codecs": [
{
"name": "https://numcodecs.dev/vlen-utf8"
},
{
"name": "https://zarr.dev/v3/codecs/zstd",
}
],
"https://zarr.dev/v3/array/chunk_grid": {
"name": "https://zarr.dev/v3/chunk_grid/regular",
},
"https://zarr.dev/v3/extensions":
{
"name": "https://example.com/zarr/offset",
},
{
"name": "https://example.com/zarr/array-statistics",
},
{
"name": "https://example.com/zarr/consolidated-metadata",
}
},
"https://zarr.dev/v3/array/dimension_names": [ "x" ],
"https://zarr.dev/v3/attributes": { ... },
"https://zarr.dev/v3/storage_transformers": []
}
where all identifiers are now prefixed with a URI full scoping the value, prefixed in this example with https://zarr.dev/v3/
. The specification would clearly list which prefix applies to all “raw” names within the specification and any new “raw” names which are brought into the specification might exist within URIs outside of https://zarr.dev/
.
Implicit Naming Context
For all intents and purposes, the immediate proposal defined in this document defines a single, unversioned global naming context owned by the Zarr organization which can provide such a mapping. This can be seen as a JSON document of the form:
"@context": {
"bool": "https://zarr.dev/bool", // used for brevity and doesn't represent the final URI
...etc...
}
This context is not written into each Zarr document but is implied.
Explicit Naming Context
Through the ZEP process, we could make it possible to define an explicit context within a zarr.json file which would override the default, unchaning naming context. This would let us in the future slowly correct raw names as necessary without a full breaking change to the format.
"@context": "https://zarr.dev/context/v3.1"
This explicit context reference would be written in the zarr.json itself.
Relationship with JSON-LD
Readers familiar with JSON-LD will recognize the “@context” definition:
- Each JSON-LD file MAY have a “@context” key which defines how name resolution works. With that it’s possible to either load a remote context field, or define namespaces and fields inline. (For Zarr naming resolution, we would likely want to avoid loading remote resources in favor of well-known contexts which are cached within the implementations.)
- All keys within the JSON-LD are then resolved against this context. EVERYTHING is a URI but can alternatively be referred to as “https://example.com/field”, “example:field” or for the default namespace “field”
Prefixes and aliases
As an existing and well-defined resolution process, other features of the JSON-LD naming mechanism might be worth considering for future adoption. For example, prefixes can be defined within the context, whether implicit or explicit, which prevents the need to use the full URI. For example, with the context:
"@context": {
"example": "https://example.com/some-prefix/",
{
"zarr_format": 3,
"data_type": "https://example.com/some-prefix/utf-16-string",
becomes:
{
"zarr_format": 3,
"data_type": "example:utf-16-string",
A more advanced feature might be to introduce a short name inline with an explicit context such that all uses of a given string can be replaced:
"@context": {
"shrt": "https://example.com/some-prefix/long-name",
},
...
"shrt": values
...
Internal implications
Under the hood, all name resolutions whether via implicit, explicit, or inline context would convert identifer names to a URI which implementations could reliably use for determining support. Once a URI is assigned, however, it should never be changed.
Copyright
This proposal is licensed under the Apache License, Version 2.0.