2022-09-21
Attending: Josh Moore (JM), Jeremy Maitin-Shephard (JMS), Ahmet Can Solak (AS), Sanket & the Do-a-thon, Martin Durant (MD), Davis Bennett (DB)
TL;DR:
Having a new attendee from the CZI Open Science Summit, we took a deep dive into the best way to capture data directly from microscopes, comparing the pros and cons of Zarr/HDF5/Zip and more. Additionally, we worked through remotely visualizing a Zarr when it’s been created on the cluster in a Jupyter notebook.
Updates:
- Sanket at CZI/NumFOCUS Summits
- Coming to San Fran next week, lunch!
Open Agenda (add here 👇🏻):
- Ahmet: BioHub
- Collaborators interested in Java implementation
- Need a good implementation
- ImageJ / BDV (folks at Janelia)
- V3: collaborators to help read it
- JMS: explicit opt-in for V3 (need to know a priori)
- Though auto-detection could be added
- neuroglancer likely has a stronger case for auto-detection
- AS: happy tensorstore users. Thanks alot! :star:
- https://github.com/zarr-developers/zarr-python/issues/1140
- resize manually? more internal with a skinnier API
- JMS: assume things within old bounds are old?
- AS: perhaps request chunks (from last savepoint) more compute heavy
- keyword argument?
- MD: “don’t bother writing where there’s no new data”
- JM: see related https://github.com/zarr-developers/zarr-python/issues/1017
- JMS/MD: use selection to fill in the new bits
- AS:
append()
is only for one axis. This might be for arbitrary axes.- perhaps
append_chunks()
- perhaps
- use case
- instruments generating lots of data quickly.
- don’t want to resize if not necessary. with fewer methods if possible.
- most efficient way?
- of course, better to know exact size.
- MD: just have size must larger and have missing chunks?
- AS: only if know when biologists will stop
- Clarification: doesn’t write the empty chunks
- MD: do edge chunks need special handling?
- JMS: no. always write the full chunk.
- (not in N5, and didn’t implement in tensorstore)
- DB: wouldn’t suggest having everything in one array
- 1 array per timepoint (doesn’t work for NGFF)
- growable arrays
- or use HDF5 for the acquisition
- AS: why? faster than zarr-python. but tensorstore? Don’t know.
- JM: let’s do that benchmark
- DB: Windows doesn’t like lots of small files
- MD: could write Zarr into Zip with no compression (basically what HDF5 does)
- DB: save data in the way that’s most effective for the acquisition
- Zarr as a great format after that
- AS: that’s what we were doing previously. but additional time adds up. people want the results faster. was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage.
- DB: easier to transfer than HDF5? No, than the raw files. Compression is a benefit.
- AS: set chunk size bigger rather than using HDF5
- JM: per camera. but can’t compress chunks.
- HDF5 compress in parallel but not write in parallel
- JMS: eventually all use cases of HDF5 but not there yet
- granularity at which you can read and write
- AS: re-chunking is faster than converting camera offline
- AS: with two camera we don’t try to write to same array with both, but multiple places
- JM: zip support in tensorstore? JMS: not yet
- JMS: also thought about LMDB. single file. pretty efficient.
- zip e.g. doesn’t support deleting.
- also only has one directory structure
- MD: HDF also has that problem.
- DB: re-writing isn’t a problem for acquisition.
- JMS: do need to checkpoint the zip directory periodically.
- AS: saving single-array per timepoint, then zip might work quite well.
- converting to zip zarr saw some worse performance. not sure where.
- MD: make sure the zip isn’t compressed.
- JM: need Zip spec
- DB: would love to hear where this goes
- MD: inverse problem
- massive HDF5 files in tar file on S3 for the purpose of multi-file dataset
- desire to distribute them as individual files
- 20G tar containing HDF5
- Kerchunk’s job was to point to these files within the tar
- or “find all the chunks in all of the files”
- works nicely!
- fetches are short but there are many of them.
- had to download it (for scanning) but don’t want users to have to do that.
- i.e. if you push for a single file, perhaps you can get the best of both worlds.
- DB: lambda function? probably. (but this was custom S3)
- JM: need Java implementation of Kerchunk (for BDV)
- DB: generate from json-schema
- AS: with kerchunk can you point to your data centers…
- MD: each chunk is a key but is a URL
- JM:
"chunk-name"URL, offset, length)
- JMS: can get the correct endpoint for a chunk
- add s3 syntax
- IPFS, mutable hashes, …
- JMS: also thought about LMDB. single file. pretty efficient.
- Collaborators interested in Java implementation
- DB: interesting workflow. any help?
- couldn’t get napari on cluster over VDI
- transforming images and saving them as zarr.
- starting static server and pointing neuroglancer at it.
- would prefer to do things programmatically in neuroglancer and it spits out a URL
- also convenient to have static file server as background process from main python (notebook)
- JMS: definitely convenient and it’s “just a web server”
- DB: don’t save that to disk? dask arrays in memory?
- JMS: neuroglancer-py does have a way to share numpy array or tensorstore object
- Socker based? Internally starting a web server.
- DB: and if it gets updated? does it block? No, background thread
- There is a method to invalidate the cache.
- Python API for making URLs? Yes.
- Could be attractive to people (Janelia) for when computing on the cluster
- JM: See also Wei’s imjoy-rpc for the usability
- JMS: works as iframe in jupyter now (DB: desirable)
- JMS: possibly using jupyter protocols would work around firewall