13th March, 2019
Date and time: 20:00 GMT (16:00 EDT)
Joining instructions: https://zoom.us/j/300670033 (see Full Zoom instructions)
Intend to attend: Stephan S, Josh M, Alistair, MartinD, TonyT (please add yourself here)
Please feel free to propose an item for the agenda below. If you do, please include your name so we know who proposed the item.
Standing items
- Introductions: new joinees should fill welcome to explain why they are interested in joining.
Attending: Stephan S, Josh M, Alistair, MartinD, TonyT
Misc introduction points
- Josh: https://github.com/joshmoore/star5 - support for writing multifile HDF5
Josh: Does it make sense to consider using multiple files as chunks. Comes from a partitioning project where multitple HDF5 datasets are used for chunking to circumvent parallel write limitation.
Alistair: If it’s purely for parallel writing then there is may be no point.
Martin: Access chunks from HDF5 that are linked from outside, e.g. Zarr.
Alistair: Did not implement external links nor region references because tricky and no use case. File-systems could be used to do external links (symlinks). Zarr has no definite root
Stephan: n5 doesn’t have a hard root, however file format version is only stored in root.
Josh: Users need to be able to tell what a unique dataset is. Users do not like directories.
Stephan: maybe having roots is a good idea.
Stephan: What are ragged arrays good for.
Alistair: Genome analysis with vectors of arbitrary length being elements of a 2D-array
https://github.com/libdynd/libdynd
https://datashape.readthedocs.io/en/latest/
https://ndtypes.readthedocs.io/en/latest/
https://xtensor.readthedocs.io/en/latest/
Josh: multi-label sets, can you use masks? Store masks?
Stephan: more interesting, sparse datasets, only specific coordinates filled with values, so can iterate over values or
Alistair: adding “memory layouts” to support the multisets from last time. At the moment, there are just C & F layouts. Could also be extensible. Applied before encoding chain. Stephan: was wording the same thing. Moving beyond “chunk is a piece of memory”. For example, a C point-of-view might want to apply a struct. Could this be the basis for the entire format, and the standard primitive types are then special cases.
Stephan: keys to find chunks, keys to find metadata. Then need to be able to load chunks, (decode), end up with contiguous block of memory. In general case, unknown length of decoded chunk. In standard case, where we know number of elements, and know primitive type, is a special case. In most general case, chunk would need to be able to say how long it is. Not sure what the best way. In n5 I use a header.
Alistair: IF you know the memory layout, you do not need to store header information.
Stephan: ability to store no memory layout? Just bytes. Gives someone this API but without needing to specify it. That’s the fundamental level of the spec. Have that in N5/Java with templated objects. Implementations must be able to transfer blocks into a ByteBuffer. That brings a worry about the variable-length arrays. Alistair: codecs commonly add headers. Stephan: but currently memory layout is in the dataset metadata? Yes. dtype, shape, layout, order.
Josh: Anything that I need to know as a user when I open a file is problematic. What is discoverable?
Alistair: Both Zarr and N5 need array metadata to decode a chunk. N5 chunks have some metadata in the header but it is not sufficient to decode the chunk.
Memory layout is not an operation but how to interpret it. Filter chain tells what to do with the memory content to generate the bytes stored in the chunk (and the inverse).
Filters and compressors are basically the same, can be all empty.
Josh: Can end-chunking issues and other things be resolved by filters?
Alistair: No.
Interpretation |
Codec | Metadata
Storage |
Identify primitives for core spec
(Agenda item proposed by Stephan S.)
We discussed this to some extent last time, but haven’t come to a final word. Byte order seems to be the only agreed upon core concept ;).
Drill into more use-cases. I do not fully understand the use-case for ragged arrays and where they are actually ragged. When we cannot come up with more use-cases, we probably have it covered.
Majority agrees that a layered spec is a good idea, higher layers are built from primitives defined in lower layers. So what is the real core (take Zarr and N5 as examples but both stem from mindsets deeply rooted in their favorite programming language and therefore make some assumptions about ‘core’ primitives (uint8, int16, etc.). A C/C++ programmer would have started with bytes that can be loaded into memory and used as a pointer of any type. May be the core should be bytes? Compression works on bytes anyways. Extended types can then tell how many bytes they consume in which order (varlength?) and how many of them are expected in a chunk. The current Zarr and N5 specs would then be a 2nd layer spec on top of this?
Also: Super important to keep it as simple as possible (or it loses its appeal)!
…