Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving Event Based Data #216

Open
CSSFrancis opened this issue Jan 26, 2024 · 10 comments
Open

Saving Event Based Data #216

CSSFrancis opened this issue Jan 26, 2024 · 10 comments

Comments

@CSSFrancis
Copy link
Member

Describe the functionality you would like to see.

This is something that I've talked about offline a couple of times but it would be good to have a more serious discussion on this and get ahead of it before it is too late :). It would be nice to develop a standard for saving event based data. This is usually saved in something like 4 columns: x, y, intensity, time but there are potentially other ways that this can be saved.

For example you could have a counted dataset where you have a grouping of [x, y] positions and some associated time etc. Note that this would be equivalent to the ragged array storage that we currently have implemented.

If you want to create an image or a spectra then you would just integrate the events over some time period.

Describe the context

This type of data doesn't really conform very well to things like hdf5 or zarr. We could segment it into something like 1 second chunks or groups of 1,000,000 events (64bits41,000,000-> ~32 mb) which would make the recreation much faster. For compression sake you probably want some sort of chunk id which details the start time and the end time of some chunk. This is how Parquet Handles larger column based arrays.

As far as recreation goes, this is a pretty fun lazy computation problem. Each integrated image can be created lazily after you slice in time. You can reslice and won't be penalized as you haven't created the dataset.

Additional information

A couple of key points:

  1. The data is sorted by time which means that searching the data should be O(N) (for lots of points this is probably too slow and we want to pre-index the data somehow (see above for saving an time index or constant size index)
  2. This problem won't be embarrassingly parallel (unless we dictate something like a frame rate for the events) but it should be close with each chunk only having to be read at most 2 times.

I feel like this might make more sense in rosettasciio rather than other places (such as hyperspy etc.) as it would be good to handle the integration (or lack of integration) at loading rather than later. There is also the question of saving and operating on the data (do we keep it as event based data or at some point during the analysis is the dataset converted to an integrated dataset?)

Actionable Items:

  1. Is HDF5 or Zarr sufficiently capable of storing this type of data? (What other options are there?)
  2. Is prioritizing compression important for this kind of data? (What size of data can we expect, are there substaintial speed benefits from compression)
  3. How should integration be handled. (Using something like scipy.sparse, converting to a dask array etc?)
  4. How should saving be handled, should we only load files or save files as well.

@ericpre @francisco-dlp @sk1p @uellue @cophus @bbammes

@ericpre
Copy link
Member

ericpre commented Jan 28, 2024

To give more context on handling sparse data in python, there are two main library that can be used:

In rosettasciio, there is precedent in reading event based data for the Velox EDS data and JEOL EDS data. For the Velox EDS data, the stream of events is converted to COO sparse array (pydata-sparse). As there is no support for sparse array in hyperspy, we use a pragmatic approach of converting the sparse array to a dense array for processing and this can be done lazily - there are some caveats for the Velox EDS case.

My general impressions are:

  • quite a lot could be done using pydata sparse and it will mature significantly in the coming months/years
  • reading data shouldn't a problem and rosettasciio will be a good place for that
  • saving sparse array is less clear: there is some precedent but it would be good to investigate how mature it is and well it integrate with the python scientific ecosystem. Zarr is most likely the most advanced on this topic. A reference thread on sparse array support in zarr: Adding sparse array support zarr-developers/zarr-specs#245

A pragmatic approach would be good to have some reader/tooling in rosettasciio that can return a sparse array (start with the COO format?). This would allow people to start using it/experimenting with it.

Tagging some more people, who may be interested and/or have experience on this topic: @kociak, @matkraj.

@uellue
Copy link

uellue commented Jan 29, 2024

Hi, in LiberTEM since 0.11 we have support for a range of sparse and dense array backends: https://libertem.github.io/LiberTEM/udf/advanced.html#sparse

Under the hood, https://github.com/LiberTEM/sparseconverter/ is taking care of detecting backends and converting data between backends.

For 4D STEM and likely also spectroscopic data the CSR format is much better than COO since it allows O(1) slicing of the navigation dimension, while canonical (sorted) COO is O(log(n)) and unsorted COO requires a full scan. In LiberTEM we've implemented a simple raw format with a TOML descriptor and three raw binary files for CSR. You can already create lazy signals from LiberTEM datasets with both sparse and dense array chunks: https://github.com/LiberTEM/LiberTEM/blob/master/examples/sparse_udf.ipynb, Section Dask and HyperSpy integration. Since parallelization and chunking depend on slicing, I would strongly recommend to store data as CSR and not COO.

By [using LiberTEM as a file reader (https://libertem.github.io/LiberTEM/dask.html#load-datasets-as-dask-arrays) you can already work with event-based data in HyperSpy: It is just converted to dense on the fly since the analysis methods rin HyperSpy don't generally work with sparse data. Ensuring that different algorithms receive data with the formats they support, while at the same time working with all file formats, was one of the trickier parts for sparse support in LiberTEM. We introduced a signaling system where processing routines and dataset indicate which array backends they support natively. From that an execution plan is derived with the necessary conversions. With that we can gradually roll out native sparse support for datasets, detectors and algorithms (UDFs).

If you are interested in sparse data for HyperSpy I can only invite you to make use of that infrastructure! As you may know, we added pervasive Dask array support in LiberTEM for such integrations: https://libertem.github.io/LiberTEM/dask.html

@uellue
Copy link

uellue commented Jan 29, 2024

...I should add, the Amsterdam Scientific Instruments CheeTah T3 camera can already save data in the CSR file format that LiberTEM supports, as a result of our cooperation. As you can see, the format is really simple and based on an established standard, so it is straightforward to open it with all kind of software. You can create a scipy.sparse.csr_matrix pretty much directly from the files with memory mapping. sparse.GCXS doesn't work because ASI doesn't save in canonical form (sorted row index), which GCXS doesn't like.

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Jan 29, 2024

@uellue Thanks for the detailed information! It seems like LiberTEM is once again ahead of the curve :)

For 4D STEM and likely also spectroscopic data the CSR format is much better than COO since it allows O(1) slicing of the navigation dimension, while canonical (sorted) COO is O(log(n)) and unsorted COO requires a full scan.

The CSR format is usually only applied to 2 dimensional data correct? In this case you would still have a fully dense navigation array? Are you using GCXS or something like a dense object array of CSR arrays?

Edit: I actually looked at the example given: You are flattening the data into a 2d array where one axis is the flattened navigation and one dimension is the flattened signal. Interesting, is there any performance difference from this vs a GCSX type implementation?

In LiberTEM we've implemented a simple raw format with a TOML descriptor and three raw binary files for CSR. You can already create lazy signals from LiberTEM datasets with both sparse and dense array chunks:

Have you given any thought to binary vs some sort of a zip like storage like hdf5 or zarr? I can see benefits to each method but would there be any specific reason against packaging things as 1 "file".

...I should add, the Amsterdam Scientific Instruments CheeTah T3 camera can already save data in the CSR file format that LiberTEM supports, as a result of our cooperation. As you can see, the format is really simple and based on an established standard, so it is straightforward to open it with all kind of software. You can create a scipy.sparse.csr_matrix pretty much directly from the files with memory mapping. sparse.GCXS doesn't work because ASI doesn't save in canonical form (sorted row index), which GCXS doesn't like.

I'd assume that it's not in a canonical form because the data comes in as a stream based on the time the event occurred? I'd assume that under the hood the scipy.sparse.csr_matrix is sorting the data when casting? That seems like something that would be unavoidable but I'm not super knowledgeable about all of the ways to store sparse information.

Out of curiosity what happens to the time information? Are you just integrating over some time frame to create the data? Is the intensity information stored as well?

If you are interested in sparse data for HyperSpy I can only invite you to make use of that infrastructure! As you may know, we added pervasive Dask array support in LiberTEM for such integrations: https://libertem.github.io/LiberTEM/dask.html

This is another reminder that I need to increase the pyxem documentation about transferring data to and from liberTEM :)

@sk1p
Copy link
Contributor

sk1p commented Jan 29, 2024

Thank you for bringing this up.

In addition to what @uellue mentioned about using CSR/GCXS for storing STEM data, I have a few other comments:

It would be nice to develop a standard for saving event based data. This is usually saved in something like 4 columns: x, y, intensity, time but there are potentially other ways that this can be saved.

Is intensity here the "time over threshold" supported by some detectors, or is this a number of counted electrons from a time bin or similar? Having access to the ToT for each event is valuable, as it can enable a form of super-resolution in post-processing, just as one example.

There are not only "pixel" events, but on some detectors, also generic TDC events are supported (think: some electrical signal on an input of the detector board went high/low). That would be a secondary set of columns (timestamp, event_kind) in your scheme?

  1. The data is sorted by time which means that searching the data should be O(N) (for lots of points this is probably too slow and we want to pre-index the data somehow (see above for saving an time index or constant size index)

On sorted data, you can do better than O(n), using binary search (average O(log n)) or interpolation search (average O(log(log(n))) with some caveats on distribution).

  1. This problem won't be embarrassingly parallel (unless we dictate something like a frame rate for the events) but it should be close with each chunk only having to be read at most 2 times.

Can you clarify why chunks would need to be read twice?

There is also the question of saving and operating on the data (do we keep it as event based data or at some point during the analysis is the dataset converted to an integrated dataset?)

I think this would best be deferred to the application - for example, in LiberTEM, the decision is taken at runtime, if the user's code supports a compatible sparse format, the most efficient conversion from the source format into the supported format is performed (which can be "zero-cost", for example when converting from the same or very similar formats in different libraries). If sparse computation not supported, the data is "densified" in small chunks.

  1. Is prioritizing compression important for this kind of data? (What size of data can we expect, are there substaintial speed benefits from compression)

As a real-world example, one of the RAW CSR data sets I have access to, shape (1017x1024x512x512), takes up about 20GiB on disk as sparse data. This is from prototype software, so in no way optimized for storage efficiency (selecting the right integer dtype etc.), so it's at least twice as large as it should be, if not more. I quickly threw the whole data set into tar+pbzip2 and got it compressed down to under 5GiB.

Often in CSR there are runs of many very similar if not identical numbers, which would benefit strongly from run-length encoding (which is used internally, for example in LZ4 or Zstd), and/or similar techniques like bit-packing.

I'm not sure if there are similar patterns in x/y/t/v style data, that's certainly something to look into.

So, it does make sense to look into compression, if only for storage efficiency.

@CSSFrancis
Copy link
Member Author

Is intensity here the "time over threshold" supported by some detectors, or is this a number of counted electrons from a time bin or similar? Having access to the ToT for each event is valuable, as it can enable a form of super-resolution in post-processing, just as one example.

There are not only "pixel" events, but on some detectors, also generic TDC events are supported (think: some electrical signal on an input of the detector board went high/low). That would be a secondary set of columns (timestamp, event_kind) in your scheme?

I think we realistically want to be able to save any kind of reasonable data. So things like ToT or some electrical signal are what I was thinking. Number of counted electrons could also be relevant as there are multiple counting methods which operate on some integrating detector and result in a sparse dataset.

  1. The data is sorted by time which means that searching the data should be O(N) (for lots of points this is probably too slow and we want to pre-index the data somehow (see above for saving an time index or constant size index)

On sorted data, you can do better than O(n), using binary search (average O(log n)) or interpolation search (average O(log(log(n))) with some caveats on distribution).

Ahh, yes you are correct, I don't think many people are using something like a bubble sort in practice :)

  1. This problem won't be embarrassingly parallel (unless we dictate something like a frame rate for the events) but it should be close with each chunk only having to be read at most 2 times.

Can you clarify why chunks would need to be read twice?

It kind of depends on the scheme. If you wanted to compress every 1000 events you might have:

Chunk 1 : 0.0 sec-0.8 sec
Chunk 2 : 0.8 sec-2.0 sec

So if you want to create a 1 second frame rate you would have to read chunk 1 once and chunk 2 twice.

There is also the question of saving and operating on the data (do we keep it as event based data or at some point during the analysis is the dataset converted to an integrated dataset?)

I think this would best be deferred to the application - for example, in LiberTEM, the decision is taken at runtime, if the user's code supports a compatible sparse format, the most efficient conversion from the source format into the supported format is performed (which can be "zero-cost", for example when converting from the same or very similar formats in different libraries). If sparse computation not supported, the data is "densified" in small chunks.

I think this seems reasonable. I think that to start it would be good to identify a "best-practice" for saving the data and then build the analysis tools around that. Most of conversion should be fairly cheap.

  1. Is prioritizing compression important for this kind of data? (What size of data can we expect, are there substaintial speed benefits from compression)

As a real-world example, one of the RAW CSR data sets I have access to, shape (1017x1024x512x512), takes up about 20GiB on disk as sparse data. This is from prototype software, so in no way optimized for storage efficiency (selecting the right integer dtype etc.), so it's at least twice as large as it should be, if not more. I quickly threw the whole data set into tar+pbzip2 and got it compressed down to under 5GiB.

Often in CSR there are runs of many very similar if not identical numbers, which would benefit strongly from run-length encoding (which is used internally, for example in LZ4 or Zstd), and/or similar techniques like bit-packing.

I'm not sure if there are similar patterns in x/y/t/v style data, that's certainly something to look into.

So, it does make sense to look into compression, if only for storage efficiency.

Thanks for trying that! Storage efficiency and i-o bandwidth limitations would both benefit from compression.

@uellue
Copy link

uellue commented Jan 29, 2024

Interesting, is there any performance difference from this vs a GCSX type implementation?

LiberTEM uses a flat navigation dimension internally and folds back to the actual shape only in the end, so that part actually works to our advantage! Many operations, such as virtual detectors (dot product), sum, standard deviation etc. process the data with flat sig dimension already, so also no change in that case. If a UDF wants a folded sig dimension it can request GCXS. The internal data structure of GCXS is the same as CSR/CSC, it just has metadata on which axes are folded how, meaning the conversion could be O(1). The performance of the scipy.sparse.csr_matrix dot product is much, much better than sparse.pydata.org. For that reason we use it where possible. However, I think that's just an implementation difference and they could be nearly equally fast.

Have you given any thought to binary vs some sort of a zip like storage like hdf5 or zarr? I can see benefits to each method but would there be any specific reason against packaging things as 1 "file".

Being able to interact with the files directly, without involving any 3rd party libraries, gives the best portability and flexibility -- reading and writing with direct I/O or liburing, memory mapping, even embedded systems etc. With three files you can write the data sequentially without knowing the final size. Also, it removes free parameters such as chunking, internal paths etc, meaning one can write a simple file reader that will perform consistently with all files. I don't like the variability you can have with HDF5 and other containers that can be pretty difficult to support. With HDF5 it is "new data source, new troubles".

I'd assume that it's not in a canonical form because the data comes in as a stream based on the time the event occurred?

Exactly, the events are scrambled a bit. ASI sorts them by frame index, which is required for CSR, but not by column index, to keep up with the data rate. Also, they don't aggregate events, i.e. if two electrons hit the same pixel you get 2x "1" with the same column index.

I'd assume that under the hood the scipy.sparse.csr_matrix is sorting the data when casting?

Yes, there are methods in scipy.sparse for that, but of course that is extra work, so processing as-is is probably better. Didn't benchmark it, though! sparseconverter has a 'sort if GCXS needs it' function: https://github.com/LiberTEM/sparseconverter/blob/4cfc0ee2ad4c37b07742db8f3643bcbd858a4e85/src/sparseconverter/__init__.py#L186-L199

As you can imagine, I spent quite some time to iron out all the rough patches with these conversions... If you want to convert between formats, I'd recommend having a look at sparseconverter because it can get a bit messy with all the special cases and bugs to consider. We pretty much brute-force every possible conversion in CI, but I'm still not sure if there isn't something lurking in a corner... My favorite bug of all time is this one: cupy/cupy#7035 -- very specific, not depending on Python package version, but CUDA version, and potentially massively different results.

@sk1p
Copy link
Contributor

sk1p commented Jan 29, 2024

On sorted data, you can do better than O(n), using binary search (average O(log n)) or interpolation search (average O(log(log(n))) with some caveats on distribution).

Ahh, yes you are correct, I don't think many people are using something like a bubble sort in practice :)

I understood the O(n) as the number for searching for an event (or slicing said array), actually sorting the data is a different discussion even, there you'd typically spend O(n log n), and only in the absolute best case you'd hit O(n) (think timsort with linear runs). Doing that while the data is acquired is then another can of worms, which we probably don't want to open here... :)

It kind of depends on the scheme. If you wanted to compress every 1000 events you might have:

Chunk 1 : 0.0 sec-0.8 sec Chunk 2 : 0.8 sec-2.0 sec

So if you want to create a 1 second frame rate you would have to read chunk 1 once and chunk 2 twice.

Ah yeah, if you add compression into the mix, you might run into these read amplification issues, that's true. Nowadays, there are also seekable formats. If they are efficient enough with small "frame sizes" (their term for chunks), that may be an option to allow for more fine-grained slicing with less read amplification, and efficiently finding offsets for slicing using e.g. binary search.

If you really only put 1k events into a chunk, it would just make sense to eat the read amplification costs, though :)

I think that to start it would be good to identify a "best-practice" for saving the data and then build the analysis tools around that.

Do you think there can be a single "best practice"? Such a format would need to be completely lossless and contain all information that was recorded, and possibly information used by the vendor's software. Otherwise, I'm concerned that this might result in having to store both the original data "as read from the detector" in a potentially vendor-specific format, in addition to the converted "best practice" format. (This is already something that is happening with some (non-sparse) vendor formats, meaning we need to archive two sets of potentially large data if we want to a) be FAIR w/ an open format, and b) still have all metadata and want to use the vendor's software to load it again.)

Most of conversion should be fairly cheap.

We have a benchmark to put numbers to that - take these as a grain of salt, as they are mostly used for deciding which conversion "path" to take, if there is no direct conversion possible. And also, these conversions handle quite small chunks of sparse data - conversion of a whole data set might paint a different picture.

@bbammes
Copy link

bbammes commented Jan 29, 2024

I did some quick calculations of file size for two detectors versus the exposure rate (in events per pixel per "frame"), assuming that a total of 20 events per pixel are accumulated during the acquisition (more or less total exposure should simply scale linearly). I used four ideas for encoding: (1) binary frame-based encoding, (2) an optimized run-length encoding scheme I've been considering, (3) frame-based CSR (where the ROW_INDEX array represents frames, and the COLUMN_INDEX array represents the pixel index within each full 2D frame), and (4) row-based CSR (where the ROW_INDEX array represents rows within each frame, and the COLUMN_INDEX array represents the column within each row).

  1. Results for a 1024 x 1024 detector without super-resolution (similar to our Celeritas camera).
  2. Results for a 4096 x 4096 detector with 2x2 super-resolution (similar to our Apollo camera).

For sparse electron microscopy data, we typically see a per-frame sparsity of ~0.001 to ~0.025. A lower exposure rate means that the frames are of limited utility because they are signal-starved. A higher exposure rate means that coincidence loss is significant, limiting data quality. Obviously, the file format should accommodate lower or higher exposure rates, but I'm interested in looking at the typical use cases.

Absent additional post-encoding compression, the run-length encoding (RLE) scheme results in the smallest file sizes. However, RLE seems best-suited for binary data and it is not clear how to extend it to include intensity, size, or other data about each event to RLE. Additionally, parallelizing input/output is not as straight-forward as it is for CSR.

Frame-based CSR (which is what I think @uellue was advocating) is certainly much smaller than storing raw frames. However, it is noticeably larger than row-based CSR for all but the very lowest exposure rates. Therefore, in the practical cases I'm considering, row-based CSR seems best. But if the exposure rate is very low, it's better to group multiple rows or use entire frames instead of rows. Therefore, the file format should enable everything from row-based CSR up to frame-based CSR.

I'm envisioning a format that consists of the following combination of files (or these files could be grouped into a single HDF5 file, and additional compression (e.g., ZIP) could be added):

  • Required: Acquisition metadata in TOML format, including certain required fields such as (1) the dimensions of each frame, (2) the number of rows grouped together in the CSR encoding (setting this to 1 would be equivalent to row-based CSR, setting it equal to the total number of rows in a frame would be equivalent to frame-based CSR, and intermediate values would also be possible), (3) datatype and bit depth for each file, (4) default time difference between each frame.
  • Required: COLUMN_INDEX array (with length = [bit_depth_of_this_file] x [total_number_of_events]).
  • Required: ROW_INDEX array (with length = [bit_depth_of_this_file] x [total_number_of_rows/frames] + 1).
  • Optional: timestamp array (overwriting the default time difference between each frame, with length = [bit_depth_of_this_file] x [total_number_of_frames]).
  • Optional: Additional files with other data associated with the events, such as event intensity, event size, time above threshold, etc., (each file having length = [bit_depth_of_this_file] x [total_number_of_events]).

Initially, the optional additional files can probably be ignored during analysis, but this will be useful to add new features in the future. For example, our low-voltage (SEM) cameras record the energy of every detected electron. We would like to save the energy of each detection event along with its coordinates, so that users can perform a posteriori energy-filtering.

@CSSFrancis
Copy link
Member Author

Okay I think that I'm starting to get the scope of the problem but someone tell me if I am wrong:

Data Acquisition and storage:

There are potentially 2 (maybe 3) different types of data coming off of cameras.

  1. A stream of events which is likely x, y, time. This is basically a COO format with no compression applied. A general compression/ chunking could be helpful here if you compressed along the time access.
  2. A counting filter is applied to a detector. In this case you operate row by row and this lends itself to CSR or row length encoding
  3. Some version of the first two with extra information about the TOT or intensity etc.

Data Processing

  1. In most cases we are transforming the data into a CSR format for processing where we have [len(nav_dim1)len(nav_dim2)...] x [len(sig_dim1)len(sig_dim2)...] for the row format. This means that slicing is faster in the navigation dimension but slower in the signal dimension. For most operations we want to slice some navigation dimension and then operate on the signal dimension so this makes sense.
  2. The CSR format doesn't (by default) include information about the time some event occurred or any information beyond intensity. A fourth file similar to the intensity file could be used to include this information.
  3. The COO format can fairly easily be transformed into the CSR format/ canonical CSR format. From a 4-D STEM prospective you don't have to sort the entire dataset as the data is already sorted in time. With a known frame exposure you can easily slice the data in time and then construct the CSR dataset by sorting the much smaller frame!

Compression and Storage

  1. This is still an active point of development. This CSR format could be stored as three chunked files. As long as the chunking lines up it would be fairly trivial to read and write concurrently. This might provide a factor of 2-4 reduction in size per @sk1p's off the cuff testing.
  2. Per @bbammes A better rate of compression might be possible using run-length encoding but this is limited in flexibility for extending to different datatypes so we probably would want to try something else.
  3. There is definitely interest in the larger data scicence community for supporting this with zarr. Given their track record for supporting fast parallel read/writes I'm inclined to push for getting involved there and supporting those efforts if we want to go that route.

Let me know if I have anything wrong! I might mess around and see how this might look using zarr. Concurrently writing three datasets shouldn't be a problem there like it is with hdf5 and it might make a simple compressed storage option with attached metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants