2022-09-21

Attending: Josh Moore (JM), Jeremy Maitin-Shephard (JMS), Ahmet Can Solak (AS), Sanket & the Do-a-thon, Martin Durant (MD), Davis Bennett (DB)

TL;DR:

Having a new attendee from the CZI Open Science Summit, we took a deep dive into the best way to capture data directly from microscopes, comparing the pros and cons of Zarr/HDF5/Zip and more. Additionally, we worked through remotely visualizing a Zarr when it’s been created on the cluster in a Jupyter notebook.

Updates:

Sanket at CZI/NumFOCUS Summits
Coming to San Fran next week, lunch!

Open Agenda (add here 👇🏻):

Ahmet: BioHub
- Collaborators interested in Java implementation
  - Need a good implementation
  - ImageJ / BDV (folks at Janelia)
  - V3: collaborators to help read it
  - JMS: explicit opt-in for V3 (need to know a priori)
    - Though auto-detection could be added
    - neuroglancer likely has a stronger case for auto-detection
  - AS: happy tensorstore users. Thanks alot! :star:
- https://github.com/zarr-developers/zarr-python/issues/1140
  - resize manually? more internal with a skinnier API
  - JMS: assume things within old bounds are old?
  - AS: perhaps request chunks (from last savepoint) more compute heavy
    - keyword argument?
  - MD: “don’t bother writing where there’s no new data”
  - JM: see related https://github.com/zarr-developers/zarr-python/issues/1017
  - JMS/MD: use selection to fill in the new bits
  - AS: append() is only for one axis. This might be for arbitrary axes.
    - perhaps append_chunks()
- use case
  - instruments generating lots of data quickly.
  - don’t want to resize if not necessary. with fewer methods if possible.
  - most efficient way?
  - of course, better to know exact size.
  - MD: just have size must larger and have missing chunks?
  - AS: only if know when biologists will stop
  - Clarification: doesn’t write the empty chunks
  - MD: do edge chunks need special handling?
    - JMS: no. always write the full chunk.
    - (not in N5, and didn’t implement in tensorstore)
  - DB: wouldn’t suggest having everything in one array
    - 1 array per timepoint (doesn’t work for NGFF)
    - growable arrays
    - or use HDF5 for the acquisition
    - AS: why? faster than zarr-python. but tensorstore? Don’t know.
    - JM: let’s do that benchmark
    - DB: Windows doesn’t like lots of small files
    - MD: could write Zarr into Zip with no compression (basically what HDF5 does)
    - DB: save data in the way that’s most effective for the acquisition
      - Zarr as a great format after that
    - AS: that’s what we were doing previously. but additional time adds up. people want the results faster. was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage.
      - DB: easier to transfer than HDF5? No, than the raw files. Compression is a benefit.
      - AS: set chunk size bigger rather than using HDF5
      - JM: per camera. but can’t compress chunks.
      - HDF5 compress in parallel but not write in parallel
      - JMS: eventually all use cases of HDF5 but not there yet
        
        granularity at which you can read and write
      - AS: re-chunking is faster than converting camera offline
      - AS: with two camera we don’t try to write to same array with both, but multiple places
  - JM: zip support in tensorstore? JMS: not yet
    - JMS: also thought about LMDB. single file. pretty efficient.
      - zip e.g. doesn’t support deleting.
      - also only has one directory structure
    - MD: HDF also has that problem.
    - DB: re-writing isn’t a problem for acquisition.
    - JMS: do need to checkpoint the zip directory periodically.
    - AS: saving single-array per timepoint, then zip might work quite well.
      - converting to zip zarr saw some worse performance. not sure where.
      - MD: make sure the zip isn’t compressed.
    - JM: need Zip spec
    - DB: would love to hear where this goes
    - MD: inverse problem
      - massive HDF5 files in tar file on S3 for the purpose of multi-file dataset
      - desire to distribute them as individual files
      - 20G tar containing HDF5
      - Kerchunk’s job was to point to these files within the tar
      - or “find all the chunks in all of the files”
      - works nicely!
      - fetches are short but there are many of them.
      - had to download it (for scanning) but don’t want users to have to do that.
      - i.e. if you push for a single file, perhaps you can get the best of both worlds.
      - DB: lambda function? probably. (but this was custom S3)
      - JM: need Java implementation of Kerchunk (for BDV)
      - DB: generate from json-schema
      - AS: with kerchunk can you point to your data centers…
        
        MD: each chunk is a key but is a URL
        
        JM: "chunk-name"URL, offset, length)
        
        JMS: can get the correct endpoint for a chunk
        
        add s3 syntax
      - IPFS, mutable hashes, …
DB: interesting workflow. any help?
- couldn’t get napari on cluster over VDI
- transforming images and saving them as zarr.
- starting static server and pointing neuroglancer at it.
- would prefer to do things programmatically in neuroglancer and it spits out a URL
- also convenient to have static file server as background process from main python (notebook)
- JMS: definitely convenient and it’s “just a web server”
- DB: don’t save that to disk? dask arrays in memory?
- JMS: neuroglancer-py does have a way to share numpy array or tensorstore object
  - Socker based? Internally starting a web server.
  - DB: and if it gets updated? does it block? No, background thread
  - There is a method to invalidate the cache.
  - Python API for making URLs? Yes.
  - Could be attractive to people (Janelia) for when computing on the cluster
- JM: See also Wei’s imjoy-rpc for the usability
- JMS: works as iframe in jupyter now (DB: desirable)
- JMS: possibly using jupyter protocols would work around firewall