2022-09-21

Attending: Josh Moore (JM), Jeremy Maitin-Shephard (JMS), Ahmet Can Solak (AS), Sanket & the Do-a-thon, Martin Durant (MD), Davis Bennett (DB)

TL;DR:

Having a new attendee from the CZI Open Science Summit, we took a deep dive into the best way to capture data directly from microscopes, comparing the pros and cons of Zarr/HDF5/Zip and more. Additionally, we worked through remotely visualizing a Zarr when it’s been created on the cluster in a Jupyter notebook.

Updates:

  • Sanket at CZI/NumFOCUS Summits
  • Coming to San Fran next week, lunch!

Open Agenda (add here 👇🏻):

  • Ahmet: BioHub
    • Collaborators interested in Java implementation
      • Need a good implementation
      • ImageJ / BDV (folks at Janelia)
      • V3: collaborators to help read it
      • JMS: explicit opt-in for V3 (need to know a priori)
        • Though auto-detection could be added
        • neuroglancer likely has a stronger case for auto-detection
      • AS: happy tensorstore users. Thanks alot! :star:
    • https://github.com/zarr-developers/zarr-python/issues/1140
      • resize manually? more internal with a skinnier API
      • JMS: assume things within old bounds are old?
      • AS: perhaps request chunks (from last savepoint) more compute heavy
        • keyword argument?
      • MD: “don’t bother writing where there’s no new data”
      • JM: see related https://github.com/zarr-developers/zarr-python/issues/1017
      • JMS/MD: use selection to fill in the new bits
      • AS: append() is only for one axis. This might be for arbitrary axes.
        • perhaps append_chunks()
    • use case
      • instruments generating lots of data quickly.
      • don’t want to resize if not necessary. with fewer methods if possible.
      • most efficient way?
      • of course, better to know exact size.
      • MD: just have size must larger and have missing chunks?
      • AS: only if know when biologists will stop
      • Clarification: doesn’t write the empty chunks
      • MD: do edge chunks need special handling?
        • JMS: no. always write the full chunk.
        • (not in N5, and didn’t implement in tensorstore)
      • DB: wouldn’t suggest having everything in one array
        • 1 array per timepoint (doesn’t work for NGFF)
        • growable arrays
        • or use HDF5 for the acquisition
        • AS: why? faster than zarr-python. but tensorstore? Don’t know.
        • JM: let’s do that benchmark
        • DB: Windows doesn’t like lots of small files
        • MD: could write Zarr into Zip with no compression (basically what HDF5 does)
        • DB: save data in the way that’s most effective for the acquisition
          • Zarr as a great format after that
        • AS: that’s what we were doing previously. but additional time adds up. people want the results faster. was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage.
          • DB: easier to transfer than HDF5? No, than the raw files. Compression is a benefit.
          • AS: set chunk size bigger rather than using HDF5
          • JM: per camera. but can’t compress chunks.
          • HDF5 compress in parallel but not write in parallel
          • JMS: eventually all use cases of HDF5 but not there yet
            • granularity at which you can read and write
          • AS: re-chunking is faster than converting camera offline
          • AS: with two camera we don’t try to write to same array with both, but multiple places
      • JM: zip support in tensorstore? JMS: not yet
        • JMS: also thought about LMDB. single file. pretty efficient.
          • zip e.g. doesn’t support deleting.
          • also only has one directory structure
        • MD: HDF also has that problem.
        • DB: re-writing isn’t a problem for acquisition.
        • JMS: do need to checkpoint the zip directory periodically.
        • AS: saving single-array per timepoint, then zip might work quite well.
          • converting to zip zarr saw some worse performance. not sure where.
          • MD: make sure the zip isn’t compressed.
        • JM: need Zip spec
        • DB: would love to hear where this goes
        • MD: inverse problem
          • massive HDF5 files in tar file on S3 for the purpose of multi-file dataset
          • desire to distribute them as individual files
          • 20G tar containing HDF5
          • Kerchunk’s job was to point to these files within the tar
          • or “find all the chunks in all of the files”
          • works nicely!
          • fetches are short but there are many of them.
          • had to download it (for scanning) but don’t want users to have to do that.
          • i.e. if you push for a single file, perhaps you can get the best of both worlds.
          • DB: lambda function? probably. (but this was custom S3)
          • JM: need Java implementation of Kerchunk (for BDV)
          • DB: generate from json-schema
          • AS: with kerchunk can you point to your data centers…
            • MD: each chunk is a key but is a URL
            • JM: "chunk-name"URL, offset, length)
            • JMS: can get the correct endpoint for a chunk
              • add s3 syntax
          • IPFS, mutable hashes, …
  • DB: interesting workflow. any help?
    • couldn’t get napari on cluster over VDI
    • transforming images and saving them as zarr.
    • starting static server and pointing neuroglancer at it.
    • would prefer to do things programmatically in neuroglancer and it spits out a URL
    • also convenient to have static file server as background process from main python (notebook)
    • JMS: definitely convenient and it’s “just a web server”
    • DB: don’t save that to disk? dask arrays in memory?
    • JMS: neuroglancer-py does have a way to share numpy array or tensorstore object
      • Socker based? Internally starting a web server.
      • DB: and if it gets updated? does it block? No, background thread
      • There is a method to invalidate the cache.
      • Python API for making URLs? Yes.
      • Could be attractive to people (Janelia) for when computing on the cluster
    • JM: See also Wei’s imjoy-rpc for the usability
    • JMS: works as iframe in jupyter now (DB: desirable)
    • JMS: possibly using jupyter protocols would work around firewall