How to handle empty `DataCollection` objects? #235

philsmt · 2021-10-29T14:11:26Z

While the construction methods mostly prevent an empty DataCollection object (i.e. no files) to exist, it is still possible to obtain it later through selection mechanisms such as DataCollection.deselect('*') or DataCollection.select_trains(np.s_[[]]).

Unfortunately there is now at least a single public API with DataCollection.run_metadata() that fails in such a case. The only other immediate point I found that uses files[0] is SourceData.__getitem__, which seems impossible to access in such a case.

This begs the question: Should such an object be allowed to exist, i.e. all APIs must be able to handle it, or should we prevent its existence in the first place?

The text was updated successfully, but these errors were encountered:

takluyver · 2021-11-11T10:11:57Z

@kakhahmed just hit this as well.

If there are sources selected but no trains, maybe we should keep one file for each source, so you can still use things like sel.get_run_value() and sel[src, key].dtype.

If there are no sources selected, it's less clear. Maybe just keep one arbitrary file open for .run_metadata()?

philsmt · 2021-11-11T11:59:50Z

Given the mechanism to have a source-less DataCollection is a bit more obscure, maybe prevent that from ever happening?

I can see the train-less DataCollection to occur much more often, it also happened for me with the PPU device filtering recently, so it should be possible to keep a reference for that sake.

takluyver · 2021-11-11T16:14:00Z

We already raise an error for a glob pattern that doesn't match anything, but passing an empty list or dict to .select() will give you a DataCollection with no sources. I'd be a bit concerned about changing that, because code like this seems reasonable, even if there's a better way to do it:

sel = run.select([(s, '*') for s in run.all_sources if 'PNCCD' in s])

Maybe it's OK to allow a DataCollection with no sources - it's only .run_metadata() that breaks in that case. Having sources but no trains is messier, because it breaks any inspection of those sources (shapes, dtypes, values from RUN).

philsmt · 2021-11-12T11:03:04Z

My point is that a train-less DataCollection seems to be more of a normal operation than a source-less one.

In your example, I would rather say an exception should be raised if there is no pnCCD rather when you're matching down to trains and there just happens to be none. When I select sources, I expect them to be there and quite likely will hardcode access to them. It is different with trains, where most likely an iteration follows. The most frequent exception I can think of is something like .select_trains(np.s_[0]).ndarray()[0], but this could be neatly solved with the new .single_value() call.

takluyver · 2021-11-12T11:15:32Z

Sorry, I was writing too quickly. I think it would be reasonable in isolation to disallow making a DataCollection with no sources. But I think it's plausible that people are already doing that, and throwing an exception will break their code in some way, which I try fairly hard to avoid. It's not a totally hard rule, but I know that people lose trust fast when an update breaks what they're doing.

philsmt · 2021-11-12T11:48:43Z

Hmm, but that answers the initial question immediately: Make all access in DataCollection safe against both no sources and no trains.

takluyver · 2021-11-12T12:02:40Z

I think I'm sold that sources-but-no-trains should be valid & working. When there are no sources, I'm still undecided between:

Make it fully work
Leave it as is (.run_metadata() not working with no sources)
Disallow it (accepting some risk of breaking code)

I'm leaning towards 2 - less risk of breaking things, but less special casing required. But I'm open to being persuaded either that this is a corner case which we can reasonably break, or that it's important enough that we should make it work properly.

philsmt · 2021-11-12T14:19:15Z

When thinking of any other clever tricks how to preserve the functionality, I was reminded of another unfortunate angle: There are files out there in the wild which conform to the European XFEL file structure, but are entirely empty of trains and sources (mostly legacy calibration files).

So yes, option 3. of disallowing is it not an option I fear. Option 2. makes sense 👍

kakhahmed · 2021-11-12T14:49:02Z

2 seems to be the safest option.

Does it make sense to have a better error message that the dataCollection is empty or something. Instead of list index out of range when .run_metadata() is used on DataCollection with no sources.

takluyver · 2021-11-12T15:37:11Z

#244 should resolve this for selecting 0 trains.

philsmt mentioned this issue Oct 29, 2021

Only merge in open_run(..., data='all') if raw is not a subset of proc #236

Merged

takluyver mentioned this issue Nov 12, 2021

Allow for usable DataCollection with no trains selected #244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle empty `DataCollection` objects? #235

How to handle empty `DataCollection` objects? #235

philsmt commented Oct 29, 2021

takluyver commented Nov 11, 2021

philsmt commented Nov 11, 2021

takluyver commented Nov 11, 2021

philsmt commented Nov 12, 2021

takluyver commented Nov 12, 2021

philsmt commented Nov 12, 2021

takluyver commented Nov 12, 2021

philsmt commented Nov 12, 2021

kakhahmed commented Nov 12, 2021

takluyver commented Nov 12, 2021

How to handle empty DataCollection objects? #235

How to handle empty DataCollection objects? #235

Comments

philsmt commented Oct 29, 2021

takluyver commented Nov 11, 2021

philsmt commented Nov 11, 2021

takluyver commented Nov 11, 2021

philsmt commented Nov 12, 2021

takluyver commented Nov 12, 2021

philsmt commented Nov 12, 2021

takluyver commented Nov 12, 2021

philsmt commented Nov 12, 2021

kakhahmed commented Nov 12, 2021

takluyver commented Nov 12, 2021

How to handle empty `DataCollection` objects? #235

How to handle empty `DataCollection` objects? #235