18th November, 2020
Attending:
Matthias Bussonnier, Jim Pivarski, Eric Pelman, Martin Durant, Ryan Williams,
Ryan Abernathey, Josh Moore
-
Josh: Thoughts on
- mat files:
any parser at all? Not even JSON.
- Martin: why? … Matthias: via Julia.
- Ryan: biggest elephant is R. Biggest problem.
-
- Josh: movement towards needing interlinks. Can handle it at the metadata layer, but wondering if it’s of interest at the zarr layer.
- Martin: or awkward array did the reverse. Task: store an awkward thing in zarr (all structured as one dimensional things) Pretty fast. No problems. But the python arrow implementation is difficult to walk through a nested structure. Unclear if this is a good idea. (Parquet is designed to do this) Arrow doesn’t support anything nested other than list-of-lists.
- POC to show that you can (if there’s some convention)
- Jim: “shredding” if you get things that don’t make sense on their own.
-
s3 difficulties across language implementations (re: netcdf-c)
-
For Pangeo Python+big 3+non-public data
-
Eric: lots of edge cases. Often a one line change (e.g. http 403 vs 404)
-
Josh: Difficulties with AWS-SDK-C++ etc.
-
Ryan: motivation behind tiledb cloud (proprietary platform) is the challenge of IAM “stuff”. We need integration test
-
https://github.com/constantinpape/zarr_implementations
-
But not cross-product with the different cloud providers
- Martin: supposed to be learning rust. Could implement
fsspec in rust → c → call it from anything.
-
Matthias: separate store vs. zarr protocol.
-
Ryan: what does that look like in C?
- Matthias: communication issue. Not a zarr issue. But a
SDK issue.
- Josh: will look into adding something to z_i that fails
to show the problem. Think we’re setting netcdf-c up to fail.
-
-
Ryan: cloud permissions. Pre-signed URLs! Key element to make private data accessible without going through cloud providers credential system. (json-web-token) NASA doesn’t want all users to need s3 credentials. Little lambda-based API generates URLs for users. Want to do the same thing for zarr … but a million objects.
-
Eric: great use case for the dense format with an index file. Martin: don’t think there’s a problem with a million hash files. Ryan: interested in the latency.
-
Ryan: need a standalone zarr-cloud-provider-framework
-
Matthias: tension in the spec. Is a store more than a mutable mapping? If you want more features, then stores need to be more complicated than k/v.
-
Ryan: Zarr as an API. We’re using HTTP/S3 + paths as a defacto API. good to encourage others doing things under the hood. Mention it in the spec?
-
- mat files:
-
Ryan: file chunk store:
-
How do we move forward?
-
Josh: trevor will have reduced capacity
- Ryan: can we agree on the process? Common use-case amongst
multiple fields where we want to access chunks within a binary file, map them to an array or group. Unlocks cool hacks. Can we agree on the data structure, then others can go off and make creative things. In geospatial, 100s of PBs in netcdf that shouldn’t be converted.
- Eric: index file specification or an API where you plugin in
code to do the hash table lookup to calculate the offset. Ryan: could you encode the hash table somehow
- Martin: not specifically zarr-centric. “How to treat
byte-offsets/ranges as a filesystem?” Ryan: it’s a type of store. Martin: It’s a type of filesystem. Matthias: a level beyond zarr. Ryan: data formatted this way should follow some specification so we can work together.
- See
- Josh: potentially add templates with some form of function
- Eric: prefixes to reference another json (i.e. breaking it up)
- Ryan: consider it a success if a few people open issues to demonstrate that others (beyond Ryan & Martin) are engaged.
- Need a version number.
-
Josh: n5 compression
-
Eric: how do you document the dimension restrictions?
- John: see
https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/lzma.py#L17 on how to specify parameters
- Eric: