27th March, 2019

Time: 20:00 GMT (16:00 EDT).

Joining instructions see above.

Attendees: Alistair, Martin D, Fabian G, Stephan S, Ryan W, Ryan A, Constantin, Ward, …

Please feel free to propose an item for the agenda below. If you do, please include your name so we know who proposed the item.

Standing items

  • Introductions: new joinees should feel welcome to introduce themselves and their interest in zarr/n5.

External storage and fsspec (Martin D)

#414 #301

Martin: Storage backend for zarr. Currently can store in anything exposing a dictionary-like interface. Some other backends have emerged, e.g., S3 and GCS. Recently for google talk about direct mapping implementation. Azure blob a new mapping was implemented, not via a file-system-like layer. Also talk about HTTP.

…I was involved in building the filesystem-like interfaces. Work done in context of dask. Dask needs glob-like pattern matching, also random access. That led to design of s3fs, gcsfs, hdfs, azure data lake.

…These file system interfaces were similar. Different from pyfilesystems2 (?). Design is for cloud use case. Comes automatically with ability to have a key/value mapping interface over the file system interface. File system spec (fsspec) standardises this. Is being used for zarr work within dask. Pulling that into central place called fsspec. Model for other file system implementations.

…Of potential interest to zarr, may simplify adding support for other file systems. On the other hand zarr only needs key/value storage, not full file system. Fsspec is released but not in use yet. S3fs and gcsfs have PRs.

…Could be an optional dependency for zarr. E.g., fsspec could handle URL dispatching to appropriate mapping implementation.

Constantin: Where to s3fs and gcsfs live?

Martin: Live in dask organisation. Mapping layer is not used so much outside zarr.

Constantin: fsspec is an effort to generalise?

Martin: Yes. E.g., if you look at PR to s3fs, it cuts out a lot of code.
https://github.com/dask/s3fs/pull/161

RyanA: we were pushing for this in pangeo. We wanted to use zarr via xarray and dask to read and write data from GCS and S3. So we created the zarr backend for xarray. Then gcsfs and s3fs appeared. We use that all together now and it works wonderfully well.

Martin: Yes and effort improved libraries. This is a WIP, can be used now, e.g., can import fsspec. Zarr could work with fsspec. Fsspec is for file-system-like things, OK for S3, maybe less so for HTTP. In some cases want a real key-value store, e.g., wouldn’t want a file-system-like interface to redis. So zarr might want it’s own mapping implementations.

StephanS: How do you use fsspec?

Martin: … the only thing that takes time is listing all keys … but for zarr generally you are asking for a certain key. Fsspec interprets that as reading a certain file, which reads from an endpoint.

StephanS: Currently zarr is key/value oriented. Read/write …

Martin: If it’s just key/value storage, can do a simpler implementation. But a file-system-like layer is useful for file-system-like things. SHould zarr duplicate any of that?

StephanS: This is useful for things other people have already spent work with, e.g., GCS or S3. But if you now have a backend that’s new. Would have to implement fsspec?

Martin: We’d never take away option of directly implementing the mapping interface. I’d like to think the design of fsspec for things that are file system like should be simpler.

Alistair: For new backend developer would have a choice of whether to implement fsspec or direct mapping.

Martin: Yes, and fsspec would also do URL dispatching.

RyanA: Wise for zarr-python to outsource many things it does currently. But those implementations of the storage classes are the de facto spec for those storage media. The spec is not specific enough to tell you how to implement e.g. posix file system storage.

…So what about e.g. a C implementation? This has to happen in parallel with spec development, where for common storage systems like file systems, zip files, object storage, how should a zarr store be stored there. Then once that spec is clear, we can ask what’s the easiest way to target that from Python. Maybe fsspec.

…Should also maintain hackability, through any mapping in. But also need to clearly define in a language independent what are core storage media.

Martin: “current implementations provide a spec of how to view storage as a mapping” that’s a spec …

Alistair: Second RyanA’s comment about having storage specs.

Martin: Comments on fsspec welcome. Waiting for dask to be PY3 only. That will enable movement of code from dask to fsspec. Work to implement for s3fs and gcsfs in progress. Talk at Anaconda to support the project.

Introductions

Fabian: ESA Earth System DataLab. Lot’s of earth observation data. Data cube. Started storing in NetCDF files, but wanted to use object storage, our partners suggested zarr format. We provide a Python API using xarray, everything worked smoothly. We also have julia users, I started experimenting. Looked at the format, started working on a julia implementation. Status is, storage backends we have file system and in memory. Working well, can produce most of zarr tutorial. Will have our data cube on S3 soon, so I’ll experiment with S3. Currently very project-driven, based on needs, but have some time here and there.

Alistair: Native Julia implementation?

Fabian: Yes. Started with pycall, but found some overhead. Also some practical things like order. In the end more convenient to do in julia.

…re specs, so far my testing has been testing against a reference implementation (Python). So have Python zarr as a test dependency. Could also test against a spec, but needs interpreting the spec. So implementing against a reference implementaiton is not too bad.

RyanA: How does julia access object storage?

Fabian: There are packages for S3, GCS.

RyanA: Pangeo very much promoting this spirit, support 100%, very cool to read the same data from Python and Julia.

Alistair: Any pain points? E.g., memory order?

Fabian: Yes Julia is F order. I have also dealt with that in julia netcdf implementation. Expose dimensions in other order. Other thing was parsing the data types e.g. i4 etc. a little inconvenient.

StephanS: Inverting vectors. If vectors in space, if do it for something you know, ok. But example …

Constantin: I spent weeks to go from Java library which works in F order to numpy in C order. Hard to wrap your head around.

StephanS: With vectors in julia, how would you store them?

Fabian: Invert metadata.

Alistair: Should Fortran order be default in all implementations including Python?

Constantin: Downstream analysis code might be inefficient where it expects C order.

StephanS: It’s just a spec thing.

Fabian: Reverse metadata.

StephanS: Because all this comes from Python where numpy prefers C order, and a strong relationship to HDF5 which also comes with C order addressing, think it’s a good idea to stick with that.

Spec development (Alistair)

System architecture: https://github.com/zarr-developers/zarr-specs/issues/8

Proposed specs file structure: https://github.com/zarr-developers/zarr-specs/issues/11

Proposed spec development process: https://github.com/zarr-developers/zarr-specs/issues/13