ZEP 4 — Metadata Conventions

Authors:

  • Ryan Abernathey (@rabernat), Earthmover PBC

Status: Draft

Type: Process

Created: 2023-01-03

Abstract

This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data using user-defined attributes in order to meet domain-specific application needs without changes to the core data model and specification, and without specification extensions.

Motivation and Scope

Zarr is a highly general array storage format. It is used widely in diverse domains, from genomics and microscopy to oceanography and climate science. While all of these domains use multi-dimensional arrays, each has distinct practices around how data and metadata are stored, queried, and processed. Supporting these domain-specific needs is out of scope for Zarr implementations; instead they should be met by higher-level libraries which may depend on Zarr for their I/O.

However, these domains may still benefit from standardization of the way their data and metadata are stored in Zarr. For example, they may expect that

  • arrays have specific names
  • specific attributes be present in arrays and groups and conform to certain constraints
  • groups are structured with a specific hierarchy, with defined relationships between different arrays in a group

Such standardization helps interoperability across different languages and packages.

This ZEP aims to formalize the concept of a Zarr convention, a mechanism for a domain community to document a standard way of storing their specific data model in Zarr. As of the time of this draft, many such ad-hoc conventions are already in use across the Zarr landscape. A goal of this ZEP is to formalize this existing practice and enhance coordination around conventions.

Conventions must fit completely within the Zarr data / metadata model of groups, arrays, and attributes thereof, requiring no changes or extension to the specification. A Zarr implementation itself should not even be aware of the existence of the convention. The line between a convention and an extension may be blurry in some cases. The key distinction lies in the implementation: the responsibility for interpreting a convention relies completely with downstream, domain-specific software, while an extension must be handled by the Zarr implementation itself. A good rule of thumb is that a user should be able to safely ignore the convention and still be able to interact with the data via the core Zarr library, even if some domain-specific context or functionality is missing. If the data are completely meaningless or unintelligible without the convention, then it should be an extension instead.

It seems likely that some features that are first supported as conventions may be promoted to extensions if they are deemed sufficiently general.

Conventions can also help users switch between different storage libraries more flexibly. Since Zarr and HDF5 implement nearly identical data models, a single convention can be applied to both formats. This allows downstream software to maintain better separation of concerns between storage and domain-specific logic.

Conventions are modular and composable. A single group or array can conform to multiple conventions.

Usage and Impact

We demonstrate the usage and impact with a simple example: unit encoding.

A convention around unit encoding might look like this

Convention Name: “units-v1”

Information about the units of a Zarr array should be stored in the units attribute and encoded as strings using the MIXF-08 standard.

To create data with this convention using zarr-python, one would write

import zarr
z = zarr.open(
    'example_with_units.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='f4'
)
z.attrs['units'] = 'm^2' 
z.attrs['zarr_conventions'] = ["units-v1"]

Reading back the data with zarr-python would work just fine. However, reading the data with a unit-aware package would enable automatic decoding. This is possible today with Pint, Xarray, and Pint-Xarray

import xarray as xr
import pint_xarray
ds = xr.open_dataset('example_with_units.zarr', engine='zarr').pint.quantify()

Now the variables have been decoded to unit-aware Pint arrays, based on the unit information encoded via the conventions. Likewise, we can serialize the data using the convention back to Zarr automatically.

ds.dequantify().to_zarr("array_with_units.zarr")

None of this requires any awareness of units on part of Zarr itself.

Backward Compatibility

Conventions by definition are fully backwards and forwards compatible with all versions of Zarr, including both V2 and V3. They require no changes to the spec or extensions and can be ignored by Zarr implementations.

Existing Zarr data conforming to ad-hoc conventions that predate this ZEP can be supported via the legacy conventions mechanism described below. New Zarr data written after this ZEP has been accepted should use the explicit conventions approach.

Detailed description

This ZEP itself describes the structure and process by which conventions may be proposed, discussed, accepted, and published by the Zarr community. This process is intended to be much more lightweight and informal than a spec change, which requires a ZEP.

Conventions will be described by a convention document. Conventions documents will be added to the https://zarr-specs.readthedocs.io/ website as a new top-level heading and a corresponding folder will be created in the zarr-developers/zarr-specs repo. A convention document template, which accompanies this ZEP, will be added to that folder. The template is intentionally simple.

Identifying a Convention

In its convention document, the author must declare the proposal as either an explicit convention or a legacy convention.

Explicit Convention

The preferred way of identifying the presence of a convention in a Zarr group or array is via the attribute zarr_conventions. This attribute must be an array of strings; each string is an identifier for the convention. Multiple conventions may be present.

For example, a group metadata JSON document with conventions present might look like this

{
    "zarr_format": 3,
    "node_type": "group",
    "attributes": {
        "zarr_conventions": ["units-v1", "foo"],
    }
}

where units-v1 and foo are the convention identifiers.

Legacy Convention

A legacy convention is a convention already in use that predates this ZEP. Data conforming to legacy conventions will not have the zarr_conventions attribute. The conventions document must therefore specify how software can identify the presence of the convention through a series of rules or tests.

For those comfortable with the terminology, legacy conventions can be thought of as a “conformance class” and a corresponding “conformance test”.

An example of a legacy convention might be existing Zarr data written following CF Conventions. Such data will have the group attribute conventions set to the value CF-1.10 (or perhaps a different version number). This forms the basis for a test for whether the group conforms to the convention.

Namespacing

Conventions may choose to store their attributes on a specific namespace. This ZEP does not specify how namespacing works; that is up to the convention. For example, the namespace may be specified as a prefix on attributes, e.g.

{
    "attributes": {"units-v1:units": "m^2"}
}

or via a nested JSON object, e.g.

{
    "attributes": {"units-v1": {"units: "m^2"}}
}

The use of namespacing is optional and is up to the convention to decide.

Versioning

There may be multiple versions of a convention. It is recommended for a convention to explicitly declare its version. For an explicit convention, the version identifier may be encoded into the convention identifier string, but this is not required. The convention document should specify how to identify the convention version.

New Convention Process

New conventions are proposed via a pull-request to the zarr-specs repo which adds a new conventions document. If the convention is already documented elsewhere, the convention document can just contain a reference to the external documentation. The author of the PR is expected to convene the relevant domain community to review and discuss the ZEP. This includes posting a link to the PR on relevant forums, mailing lists, and social-media platforms.

The goal of the discussion is to reach a consensus among the domain community regarding the convention. The Zarr steering council, together with the PR author, will determine if a consensus has been reached, at which point the PR can be merged and the convention published on the website. If a consensus cannot be reached, the steering council may still decide to publish the convention, accompanied by a disclaimer that it is not a consensus, and noting any objections that were raised during the discussion.

It is also possible that multiple, competing conventions exist in the same domain. While not ideal, it’s not up to the Zarr community to resolve such domain-specific debates. These conventions should still be documented in a central location, which hopefully helps move towards alignment.

Updating a Convention

Conventions should be versioned using incremental integers, starting from 1. Or, if the community already has an existing versioning system for their convention, that can be used instead (e.g. CF conventions). The community is free to update their convention via a pull request using the same consensus process described above. The conventions document should include a changelog. Details of how to manage changes and backwards compatibility are left to the domain community.

This proposal is inspired heavily by the way Zarr is already used today in the weather and climate community. The Xarray software package is built around the NetCDF Data Model. The original netCDF4 format is implemented on top of HDF5. When implementing a Zarr reader / writer for Xarray, the developers found a way to map the NetCDF data model to Zarr without requiring any changes to Zarr itself. This is a de-facto convention.

The mapping between Zarr and NetCDF enables further layering of conventions. Specifically, the Climate-Forecast conventions (CF Conventions) sit on top of the NetCDF data model. It’s useful to quote from the CF conventions, as they provide a good summary of the value of conventions more broadly:

The netCDF interface enables but does not require the creation of self-describing datasets. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time.

An important benefit of a convention is that it enables software tools to display data and perform operations on specified subsets of the data with minimal user intervention. It is possible to provide the metadata describing how a field is located in time and space in many different ways that a human would immediately recognize as equivalent. The purpose in restricting how the metadata is represented is to make it practical to write software that allows a machine to parse that metadata and to automatically associate each data value with its location in time and space. It is equally important that the metadata be easy for human users to write and to understand.

By separating the conventions from the underlying storage container, we enable more modularity in a domain’s software ecosystem. With conventions, Zarr can be substituted more easily for HDF5 with minimal changes to the rest of the software stack.

There are already many conventions operating today in Zarr. Prominent examples include

Implementation

Implementation of this ZEP requires only changes to the zarr-specs website.

Alternatives

The main alternative to the concept of conventions is the concept of an optional extension, discussed in ZEP0001. All of the features described above could be implemented as spec extensions instead. Then the responsibility for parsing the extensions and implementing the expected features (e.g. creating geographical coordinate reference systems, decoding units) would have to be handled by the Zarr implementation. This would lead to potentially unlimited scope increases for the zarr core libraries.

Discussion

License

CC0
To the extent possible under law, the authors have waived all copyright and related or neighboring rights to ZEP 4.