zarrs Backend • pizzarr

pizzarr ships in two tiers. The CRAN build is pure R — no Rust compilation, no system dependencies. It handles local and HTTP Zarr stores with sequential chunk I/O via lapply. The r-universe build compiles in the zarrs Rust crate via extendr, adding parallel decompression, cloud-native store backends (S3, GCS), and codecs beyond what R packages provide.

The split exists because CRAN’s macOS build machines ship a Rust toolchain (rustc 1.84) that is too old for zarrs, which requires rustc >= 1.91. r-universe builds against the latest stable toolchain, so it can compile zarrs and distribute pre-built binaries. End users on either tier install with install.packages() — no Rust toolchain needed.

Checking availability

library(pizzarr)

has_zarrs <- pizzarr:::.pizzarr_env$zarrs_available

pizzarr_compiled_features() lists the feature flags compiled into the zarrs backend. On the CRAN tier it returns character(0) with a message; on the r-universe tier it returns the compiled capabilities:

pizzarr_compiled_features()
#> [1] "zarrs"        "filesystem"   "http_sync"    "gzip"         "blosc"       
#> [6] "zstd"         "object_store" "s3"           "gcs"

The internal flag .pizzarr_env$zarrs_available is a logical scalar set once at package load. Dispatch logic throughout pizzarr checks this flag to decide whether to call into Rust or fall through to the R-native path:

pizzarr:::.pizzarr_env$zarrs_available
#> [1] TRUE

Upgrading to the zarrs tier

pizzarr_upgrade() prints the r-universe install command when zarrs is not compiled in, or confirms that the backend is already present:

pizzarr_upgrade()
#> zarrs backend is already available.

The startup message that CRAN users see on library(pizzarr) can be silenced with options(pizzarr.suggest_runiverse = FALSE).

Probing store metadata

The examples below require the zarrs backend. When this vignette is built without it, the code chunks are not evaluated.

zarrs_node_exists() opens a filesystem store via the Rust backend, probes for V2 and V3 metadata keys at a given path, and returns a list with three fields: exists (logical), node_type (character), and zarr_format (integer or NULL). The store handle is cached on the Rust side — subsequent calls to the same store path reuse it without re-opening.

V2 store

v2_root <- pizzarr_sample("fixtures/v2/data.zarr")

# Root group
zarrs_node_exists(v2_root, "")
#> $exists
#> [1] TRUE
#> 
#> $node_type
#> [1] "group"
#> 
#> $zarr_format
#> [1] 2

# An array within the store
zarrs_node_exists(v2_root, "1d.contiguous.lz4.i2")
#> $exists
#> [1] TRUE
#> 
#> $node_type
#> [1] "array"
#> 
#> $zarr_format
#> [1] 2

# A path that does not exist
zarrs_node_exists(v2_root, "does_not_exist")
#> $exists
#> [1] FALSE
#> 
#> $node_type
#> [1] "none"
#> 
#> $zarr_format
#> NULL

V3 store

V2 and V3 detection is automatic. zarrs probes for zarr.json first (V3), then falls back to .zarray / .zgroup (V2):

v3_root <- pizzarr_sample("fixtures/v3/data.zarr")

zarrs_node_exists(v3_root, "")
#> $exists
#> [1] TRUE
#> 
#> $node_type
#> [1] "group"
#> 
#> $zarr_format
#> [1] 3

Store cache management

The Rust backend holds open store handles in a process-global cache keyed by normalized path. zarrs_close_store() removes a handle from the cache and returns TRUE. A second call to the same path returns FALSE — it was already removed:

zarrs_close_store(v2_root)
#> [1] TRUE
zarrs_close_store(v2_root)
#> [1] FALSE

zarrs_close_store(v3_root)
#> [1] TRUE

Array metadata

zarrs_open_array_metadata() opens a zarrs array and returns its metadata as a named list. The store handle is cached, so repeated calls to the same store are fast. The returned list contains shape, chunks, dtype, r_type, fill_value_json, zarr_format, and order.

V2 array

v2_root <- pizzarr_sample("fixtures/v2/data.zarr")
zarrs_open_array_metadata(v2_root, "1d.contiguous.raw.i2")
#> $shape
#> [1] 4
#> 
#> $chunks
#> [1] 4
#> 
#> $dtype
#> [1] "int16 / <i2"
#> 
#> $r_type
#> [1] "integer"
#> 
#> $fill_value_json
#> [1] "[0, 0]"
#> 
#> $zarr_format
#> [1] 2
#> 
#> $order
#> [1] "C"

V3 array

V3 arrays work the same way. The zarr_format field distinguishes V2 from V3:

v3_root <- pizzarr_sample("fixtures/v3/data.zarr")
zarrs_open_array_metadata(v3_root, "1d.contiguous.gzip.i2")
#> $shape
#> [1] 4
#> 
#> $chunks
#> [1] 4
#> 
#> $dtype
#> [1] "int16 / <i2"
#> 
#> $r_type
#> [1] "integer"
#> 
#> $fill_value_json
#> [1] "[0, 0]"
#> 
#> $zarr_format
#> [1] 3
#> 
#> $order
#> [1] "C"

Data type classification

The r_type field maps zarrs data types to R-compatible type families. zarrs numeric types are classified as "double", "integer", or "logical" based on what R can represent natively:

double: float64 (zero-cost), float32 (widened), uint32/int64/uint64 (widened, precision risk > 2^53)
integer: int32 (zero-cost), int8/int16/uint8/uint16 (widened)
logical: bool

Unsupported types (strings, complex) report "unsupported" and fall back to the R-native code path.

zarrs_close_store(v2_root)
#> [1] TRUE
zarrs_close_store(v3_root)
#> [1] TRUE

Runtime info and tuning

zarrs_runtime_info() reports the current zarrs configuration — the codec concurrency target, thread pool size, how many store handles are cached, and which features were compiled in:

zarrs_runtime_info()
#> $codec_concurrent_target
#> [1] 4
#> 
#> $nthreads
#> [1] 4
#> 
#> $store_cache_entries
#> [1] 0
#> 
#> $tokio_active
#> [1] FALSE
#> 
#> $compiled_features
#> [1] "zarrs"        "filesystem"   "http_sync"    "gzip"         "blosc"       
#> [6] "zstd"         "object_store" "s3"           "gcs"

pizzarr_config()

pizzarr_config() is the main interface for viewing and changing concurrency settings. Called with no arguments it returns the current state; with arguments it sets the specified values:

# View current settings
pizzarr_config()
#> $nthreads
#> [1] 4
#> 
#> $concurrent_target
#> [1] 4
#> 
#> $http_batch_range_requests
#> [1] TRUE

# Set codec concurrency to 2 parallel operations per read/write
pizzarr_config(concurrent_target = 2L)
zarrs_runtime_info()$codec_concurrent_target
#> [1] 2

Three settings are available:

nthreads — rayon thread pool size. Set-once per R session (the thread pool can only be initialised once). For reliable session-level control, set the PIZZARR_NTHREADS environment variable before starting R.
concurrent_target — how many codec operations zarrs runs in parallel within a single read or write call. Can be changed at any time.
http_batch_range_requests — whether HTTP stores use multipart range requests (default TRUE). Set to FALSE for servers with incomplete multipart support. Takes effect on the next store open.

All three settings can also be configured via environment variables (PIZZARR_NTHREADS, PIZZARR_CONCURRENT_TARGET, PIZZARR_HTTP_BATCH_RANGE_REQUESTS) or R options (pizzarr.nthreads, etc.), which are read at package load time. Environment variables persist across sessions without needing .Rprofile edits.

The lower-level zarrs_set_codec_concurrent_target() function is still available for direct use:

zarrs_set_codec_concurrent_target(2L)
#> NULL
zarrs_runtime_info()$codec_concurrent_target
#> [1] 2

Reading data via zarrs

When the zarrs backend is available and the selection is a contiguous slice (step == 1), ZarrArray$get_item() dispatches reads to zarrs automatically. zarrs handles chunk identification, parallel decompression, and codec execution internally, bypassing pizzarr’s R-native chunk loop. Scalar integer selections (e.g., selecting a single row of a matrix) are also eligible — they become length-1 ranges on the Rust side. Unsupported selections (step > 1 slices, fancy indexing, MemoryStore) fall through to the R-native path transparently.

Basic read

d <- tempfile("zarrs_vignette_")
z <- zarr_create(store = d, shape = c(100L, 50L), chunks = c(10L, 10L),
                 dtype = "<f8")
z$set_item("...", array(as.double(seq_len(5000)), dim = c(100, 50)))
#> NULL

# Re-open and read a subset --- zarrs handles the chunk I/O
z2 <- zarr_open(store = d)
result <- z2$get_item(list(slice(1L, 10L), slice(1L, 5L)))
dim(result$data)
#> [1] 10  5

Direct zarrs_get_subset call

For lower-level access, zarrs_get_subset() reads a contiguous subset directly via the Rust backend. Ranges are 0-based with exclusive stop, matching zarrs conventions:

result <- zarrs_get_subset(d, "", list(c(0L, 10L), c(0L, 5L)), NULL)
str(result)
#> List of 2
#>  $ data : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ shape: int [1:2] 10 5

Concurrency control

The optional concurrent_target parameter (or the pizzarr.concurrent_target R option) controls how many parallel codec operations zarrs uses within a single read call. Setting it to 1L disables parallel decompression:

result <- zarrs_get_subset(d, "", list(c(0L, 10L), c(0L, 5L)), 1L)
length(result$data)
#> [1] 50

zarrs_close_store(d)
#> [1] TRUE
unlink(d, recursive = TRUE)

Creating arrays via zarrs

When the zarrs backend is available and the store is a writable filesystem path, zarr_create() dispatches array creation to zarrs instead of building metadata JSON in R. zarrs validates the metadata structure, writes it to the store, and the array is ready for data. The dispatch is transparent — the same zarr_create() call works on both tiers, and unsupported configurations (MemoryStore, object dtypes, custom filters) fall through to the R-native path.

Transparent dispatch

The zarr_create() examples earlier in this vignette already use this path when zarrs is available. The zarrs backend handles V2 and V3 formats, all 11 numeric data types, and four codec presets:

# V3 array with gzip compression
d <- tempfile("zarrs_create_vignette_")
z <- zarr_create(store = d, shape = c(20L, 10L), chunks = c(10L, 10L),
                 dtype = "<f8", zarr_format = 3L)
z
#> <ZarrArray> /
#>   Shape       : (20, 10)
#>   Chunks      : (10, 10)
#>   Data type   : <f8
#>   Fill value  : 0
#>   Order       : C
#>   Read-only   : FALSE
#>   Compressor  : ZstdCodec
#>   Store type  : DirectoryStore
#>   Zarr format : 3

# Confirm V3 metadata was written
file.exists(file.path(d, "zarr.json"))
#> [1] TRUE

zarrs_close_store(d)
#> [1] FALSE
unlink(d, recursive = TRUE)

Direct zarrs_create_array call

zarrs_create_array() provides lower-level access to the Rust creation path. It accepts V3-style data type names ("float64", "int32", "bool", etc.) and a codec preset string ("none", "gzip", "blosc", or "zstd"). The return value is the same metadata list as zarrs_open_array_metadata():

d <- tempfile("zarrs_create_direct_")
dir.create(d)

meta <- zarrs_create_array(
  store_url = d,
  array_path = "",
  shape = c(100L, 50L),
  chunks = c(10L, 10L),
  dtype = "float64",
  codec_preset = "gzip",
  fill_value = 0.0,
  attributes_json = "{}",
  zarr_format = 3L
)
str(meta)
#> List of 7
#>  $ shape          : int [1:2] 100 50
#>  $ chunks         : int [1:2] 10 10
#>  $ dtype          : chr "float64"
#>  $ r_type         : chr "double"
#>  $ fill_value_json: chr "[0, 0, 0, 0, 0, 0, 0, 0]"
#>  $ zarr_format    : int 3
#>  $ order          : chr "C"

The array is immediately usable for reads and writes:

zarrs_set_subset(d, "", list(c(0L, 10L), c(0L, 5L)),
                 as.double(1:50), NULL)
#> [1] TRUE
result <- zarrs_get_subset(d, "", list(c(0L, 10L), c(0L, 5L)), NULL)
head(result$data)
#> [1] 1 2 3 4 5 6

zarrs_close_store(d)
#> [1] TRUE
unlink(d, recursive = TRUE)

Codec presets

The zarrs creation path supports four named codec presets. Custom codec configurations fall through to the R-native path.

Preset	V2 compressor	V3 codec chain	Notes
`"none"`	null	bytes only	No compression
`"gzip"`	gzip, level 1	bytes + gzip(1)	Fast, reasonable ratio
`"blosc"`	blosc, lz4, clevel 5	bytes + blosc(lz4, 5)	Requires `blosc` feature
`"zstd"`	—	bytes + zstd(3)	V3 only; requires `zstd` feature

One difference from the R-native path: zarrs uses the "gzip" compressor id for V2 arrays, while zarr-python uses "zlib". Both produce gzip-compatible output, and zarrs reads either id when opening existing arrays.

Writing data via zarrs

The write path mirrors the read path. When the zarrs backend is available and the selection qualifies (contiguous slices, filesystem-backed store), ZarrArray$set_item() dispatches writes to zarrs instead of iterating over chunks in R. zarrs encodes the data, splits it across the affected chunks, and writes them to disk — using its internal thread pool for parallel compression when multiple chunks are involved.

Data type narrowing happens on the Rust side. R doubles narrow to the array’s stored type (float32, int64, uint32, etc.) and R integers narrow to smaller integer types (int16, int8, uint8, uint16) with range checking. An out-of-range value produces an error rather than silent truncation.

Basic write

d <- tempfile("zarrs_write_vignette_")
z <- zarr_create(store = d, shape = c(20L, 10L), chunks = c(10L, 10L),
                 dtype = "<f8")

# set_item dispatches to zarrs when eligible
z$set_item("...", array(as.double(1:200), dim = c(20, 10)))
#> NULL

# Read back to confirm
z2 <- zarr_open(store = d)
result <- z2$get_item(list(slice(1L, 5L), slice(1L, 3L)))
result$data
#>      [,1] [,2] [,3]
#> [1,]    1   21   41
#> [2,]    2   22   42
#> [3,]    3   23   43
#> [4,]    4   24   44
#> [5,]    5   25   45

Partial overwrite

Writing to a subset of an existing array works the same way. zarrs reads the affected chunks, merges the new data, and writes them back:

# Overwrite rows 3-7, columns 1-2
z$set_item(list(slice(3L, 7L), slice(1L, 2L)),
           array(rep(-1.0, 10), dim = c(5, 2)))
#> NULL

result <- z2$get_item(list(slice(1L, 10L), slice(1L, 3L)))
result$data
#>       [,1] [,2] [,3]
#>  [1,]    1   21   41
#>  [2,]    2   22   42
#>  [3,]   -1   -1   43
#>  [4,]   -1   -1   44
#>  [5,]   -1   -1   45
#>  [6,]   -1   -1   46
#>  [7,]   -1   -1   47
#>  [8,]    8   28   48
#>  [9,]    9   29   49
#> [10,]   10   30   50

Direct zarrs_set_subset call

zarrs_set_subset() provides lower-level access to the Rust write path. Data is a flat vector in R’s native F-order (column-major) — the Rust backend handles the F-to-C order conversion internally. The function returns TRUE on success:

# Write 10 values to the first row (0-based range [0, 1) x [0, 10))
zarrs_set_subset(d, "", list(c(0L, 1L), c(0L, 10L)),
                   as.double(101:110), NULL)
#> [1] TRUE

result <- zarrs_get_subset(d, "", list(c(0L, 1L), c(0L, 10L)), NULL)
result$data
#>  [1] 101 102 103 104 105 106 107 108 109 110

zarrs_close_store(d)
#> [1] TRUE
unlink(d, recursive = TRUE)

HTTP reads via zarrs

When the http_sync feature is compiled in, the zarrs backend can read directly from HTTP/HTTPS Zarr stores using the zarrs_http crate. This bypasses pizzarr’s R-native crul-based chunk loop, giving parallel chunk decode on remote data.

HTTP stores are read-only in zarrs — write dispatch (set_item) falls through to the R-native path automatically.

Transparent dispatch

The zarrs fast path activates automatically when an HttpStore-backed array is read with a contiguous selection. No code changes are needed compared to the R-native path:

url <- "https://raw.githubusercontent.com/DOI-USGS/rnz/main/inst/extdata/bcsd.zarr"

z <- zarr_open(store = HttpStore$new(url))

# zarrs handles the HTTP reads + parallel decompression
pr <- z$get_item("pr")
pr
#> <ZarrArray> /pr
#>   Shape       : (12, 33, 81)
#>   Chunks      : (12, 33, 81)
#>   Data type   : <f4
#>   Fill value  : 1.00000002004088e+20
#>   Order       : C
#>   Read-only   : TRUE
#>   Compressor  : ZstdCodec
#>   Store type  : HttpStore
#>   Zarr format : 2

# Read a subset --- zarrs fetches only the chunks that overlap
result <- pr$get_item(list(slice(1L, 3L), slice(1L, 5L), slice(1L, 5L)))
dim(result$data)
#> [1] 3 5 5

Direct zarrs_get_subset from HTTP

zarrs_get_subset() also works with HTTP URLs. The store handle is cached on the Rust side, so repeated reads to the same URL reuse the connection:

meta <- zarrs_open_array_metadata(url, "pr")
str(meta[c("shape", "dtype", "zarr_format")])
#> List of 3
#>  $ shape      : int [1:3] 12 33 81
#>  $ dtype      : chr "float32 / <f4"
#>  $ zarr_format: int 2

# Read a single element (first along each dimension)
ranges <- lapply(seq_along(meta$shape), function(i) c(0L, 1L))
result <- zarrs_get_subset(url, "pr", ranges, NULL)
result$data
#> [1] 159.08

zarrs_close_store(url)
#> [1] TRUE

Feature detection

Check whether HTTP support is compiled in with pizzarr_compiled_features(). When "http_sync" is present, zarrs can open http:// and https:// URLs. When it is absent, HTTP reads fall through to the R-native crul-based path:

"http_sync" %in% pizzarr_compiled_features()
#> [1] TRUE

S3 reads via zarrs

When the s3 feature is compiled in, the zarrs backend can read from Amazon S3 buckets using the object_store crate with an async-to-sync adapter. Public buckets work without credentials (unsigned requests). Authenticated access uses standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION).

S3 stores are currently read-only via zarrs — write operations fall through to the R-native path.

# OME-Zarr bonsai dataset on AWS Open Data (V2, zstd, uint8)
s3_url <- "s3://ome-zarr-scivis/v0.4/64x0/bonsai.ome.zarr"

# Read array metadata
meta <- zarrs_open_array_metadata(s3_url, "scale0/bonsai")
str(meta[c("shape", "dtype", "zarr_format")])
#> List of 3
#>  $ shape      : int [1:3] 256 256 256
#>  $ dtype      : chr "uint8 / |u1"
#>  $ zarr_format: int 2

# Read a small subset (first 4x4x4 corner)
result <- zarrs_get_subset(s3_url, "scale0/bonsai",
                           list(c(0L, 4L), c(0L, 4L), c(0L, 4L)), NULL)
str(result)
#> List of 2
#>  $ data : int [1:64] 40 40 40 41 40 40 40 41 40 40 ...
#>  $ shape: int [1:3] 4 4 4

zarrs_close_store(s3_url)
#> [1] TRUE

GCS and other cloud stores

GCS data hosted on Google Cloud Storage is publicly accessible via HTTPS endpoints. The zarrs HTTP backend reads these directly:

# Pangeo ECCO ocean basins (V2, blosc/lz4, float32)
gcs_url <- "https://storage.googleapis.com/pangeo-data/ECCO_basins.zarr"

meta <- zarrs_open_array_metadata(gcs_url, "basin_mask")
cat("Shape:", paste(meta$shape, collapse = " x "), "\n")
#> Shape: 13 x 90 x 90
cat("Dtype:", meta$dtype, "\n")
#> Dtype: float32 / <f4

# Read a single basin mask slice
result <- zarrs_get_subset(gcs_url, "basin_mask",
                           list(c(0L, 1L), c(0L, 90L), c(0L, 90L)), NULL)
cat("Slice dimensions:", paste(result$shape, collapse = " x "), "\n")
#> Slice dimensions: 1 x 90 x 90

zarrs_close_store(gcs_url)
#> [1] TRUE

Authenticated GCS access via gs:// URLs requires the gcs compiled feature and GCP credentials (environment variables or application default credentials). The S3Store and GcsStore R6 classes provide URL wrappers for high-level use with zarr_open():

# S3 (requires s3 feature)
z <- zarr_open(store = S3Store$new("s3://bucket/path/to/store.zarr"))

# GCS (requires gcs feature + credentials)
z <- zarr_open(store = GcsStore$new("gs://bucket/path/to/store.zarr"))

C/F order handling

zarrs stores data in C-order (row-major), while R uses F-order (column-major). The Rust backend handles this conversion transparently:

Reads: zarrs_get_subset() returns data in F-order, ready for array(data, dim = shape) with no aperm() needed.
Writes: zarrs_set_subset() accepts F-order data and converts to C-order internally before writing to the store.

The transpose uses cache-blocked tiling for 2D arrays and output-order iteration with incremental index tracking for higher dimensions, matching or exceeding the performance of R’s C-level aperm().