Support str, path object or file-like object on file read #302

hsuominen · 2024-08-16T13:20:11Z

Describe the functionality you would like to see.

For a number of applications it would be preferable if file reading supported file-like objects as well as strings or paths.

ericpre · 2024-08-16T18:00:55Z

Can you elaborate on the use case please? What type of format are you thinking of?

hsuominen · 2024-08-16T18:15:43Z

Appreciate the quick response. Basically I'm hoping that rosettasciio would support a similar interface to e.g. pandas or imageio:

https://imageio.readthedocs.io/en/stable/_autosummary/imageio.v3.imread.html#imageio.v3.imread
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

This would enable smoother use in distributed applications where the actual loading of the file is done without access to the original filesystem on which the file is stored, and would just be passed as a file-like object:
(copied from pandas docs):

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

ericpre · 2024-08-17T08:37:01Z

Off the top of my head, there may be already a few formats that can do that but I suspect that rosettasciio supports a wider variety of type of file than imageio and pandas and depending on the type, it may behave differently.

Here is a list of the different type of files

binary file
text file
h5py file - see caveats in Passing file objects into h5py.File is tricky h5py/h5py#1698
numpy file
zarr file, maybe some zarr store will work
multiple file, typically a binary and text file, for example ripple format where the metadata are in a separate file

There should be some low hanging fruit as it should be easy to implement for some type.

CSSFrancis · 2024-08-17T11:28:16Z

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

I think zarr might be a good place to start. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.LRUStoreCache

This store implementation uses a LRU cache over an s3 bucket which might be interesting if aws is hosting data.

hsuominen · 2024-08-20T18:51:57Z

@hsuominen is the idea that you are loading data that isn't on the computer doing the operation?

yes that's right.

Our intent is to get the data out of proprietary formats and into e.g. zarr (which looks great), but we need to run this extraction on compute that doesn't have the files sitting locally. There are fairly easy workarounds (e.g. using a TempFile) but thought it would be good to get this discussion going as I can see others eventually running into similar needs.

Looking specifically at some of the file formats we are interested in, the changes needed in some cases would be pretty trivial (as @ericpre hinted):

rosettasciio/rsciio/digitalmicrograph/_api.py

Lines 1278 to 1279 in e499110

    
           with open(filename, "rb") as f: 
        
               dm = DigitalMicrographReader(f)

but likely harder in others:

rosettasciio/rsciio/emd/_api.py

Lines 171 to 173 in e499110

    
           file = h5py.File(filename, "r") 
        
           dictionaries = [] 
        
           try:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support str, path object or file-like object on file read #302

Support str, path object or file-like object on file read #302

hsuominen commented Aug 16, 2024

ericpre commented Aug 16, 2024

hsuominen commented Aug 16, 2024

ericpre commented Aug 17, 2024

CSSFrancis commented Aug 17, 2024

hsuominen commented Aug 20, 2024 •

edited

Loading

Support str, path object or file-like object on file read #302

Support str, path object or file-like object on file read #302

Comments

hsuominen commented Aug 16, 2024

Describe the functionality you would like to see.

ericpre commented Aug 16, 2024

hsuominen commented Aug 16, 2024

ericpre commented Aug 17, 2024

CSSFrancis commented Aug 17, 2024

hsuominen commented Aug 20, 2024 • edited Loading

hsuominen commented Aug 20, 2024 •

edited

Loading