Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle empty DataCollection objects? #235

Open
philsmt opened this issue Oct 29, 2021 · 10 comments
Open

How to handle empty DataCollection objects? #235

philsmt opened this issue Oct 29, 2021 · 10 comments

Comments

@philsmt
Copy link
Contributor

philsmt commented Oct 29, 2021

While the construction methods mostly prevent an empty DataCollection object (i.e. no files) to exist, it is still possible to obtain it later through selection mechanisms such as DataCollection.deselect('*') or DataCollection.select_trains(np.s_[[]]).

Unfortunately there is now at least a single public API with DataCollection.run_metadata() that fails in such a case. The only other immediate point I found that uses files[0] is SourceData.__getitem__, which seems impossible to access in such a case.

This begs the question: Should such an object be allowed to exist, i.e. all APIs must be able to handle it, or should we prevent its existence in the first place?

@takluyver
Copy link
Member

@kakhahmed just hit this as well.

If there are sources selected but no trains, maybe we should keep one file for each source, so you can still use things like sel.get_run_value() and sel[src, key].dtype.

If there are no sources selected, it's less clear. Maybe just keep one arbitrary file open for .run_metadata()?

@philsmt
Copy link
Contributor Author

philsmt commented Nov 11, 2021

Given the mechanism to have a source-less DataCollection is a bit more obscure, maybe prevent that from ever happening?

I can see the train-less DataCollection to occur much more often, it also happened for me with the PPU device filtering recently, so it should be possible to keep a reference for that sake.

@takluyver
Copy link
Member

We already raise an error for a glob pattern that doesn't match anything, but passing an empty list or dict to .select() will give you a DataCollection with no sources. I'd be a bit concerned about changing that, because code like this seems reasonable, even if there's a better way to do it:

sel = run.select([(s, '*') for s in run.all_sources if 'PNCCD' in s])

Maybe it's OK to allow a DataCollection with no sources - it's only .run_metadata() that breaks in that case. Having sources but no trains is messier, because it breaks any inspection of those sources (shapes, dtypes, values from RUN).

@philsmt
Copy link
Contributor Author

philsmt commented Nov 12, 2021

My point is that a train-less DataCollection seems to be more of a normal operation than a source-less one.

In your example, I would rather say an exception should be raised if there is no pnCCD rather when you're matching down to trains and there just happens to be none. When I select sources, I expect them to be there and quite likely will hardcode access to them. It is different with trains, where most likely an iteration follows. The most frequent exception I can think of is something like .select_trains(np.s_[0]).ndarray()[0], but this could be neatly solved with the new .single_value() call.

@takluyver
Copy link
Member

Sorry, I was writing too quickly. I think it would be reasonable in isolation to disallow making a DataCollection with no sources. But I think it's plausible that people are already doing that, and throwing an exception will break their code in some way, which I try fairly hard to avoid. It's not a totally hard rule, but I know that people lose trust fast when an update breaks what they're doing.

@philsmt
Copy link
Contributor Author

philsmt commented Nov 12, 2021

Hmm, but that answers the initial question immediately: Make all access in DataCollection safe against both no sources and no trains.

@takluyver
Copy link
Member

I think I'm sold that sources-but-no-trains should be valid & working. When there are no sources, I'm still undecided between:

  1. Make it fully work
  2. Leave it as is (.run_metadata() not working with no sources)
  3. Disallow it (accepting some risk of breaking code)

I'm leaning towards 2 - less risk of breaking things, but less special casing required. But I'm open to being persuaded either that this is a corner case which we can reasonably break, or that it's important enough that we should make it work properly.

@philsmt
Copy link
Contributor Author

philsmt commented Nov 12, 2021

When thinking of any other clever tricks how to preserve the functionality, I was reminded of another unfortunate angle: There are files out there in the wild which conform to the European XFEL file structure, but are entirely empty of trains and sources (mostly legacy calibration files).

So yes, option 3. of disallowing is it not an option I fear. Option 2. makes sense 👍

@kakhahmed
Copy link

2 seems to be the safest option.

Does it make sense to have a better error message that the dataCollection is empty or something. Instead of list index out of range when .run_metadata() is used on DataCollection with no sources.

@takluyver
Copy link
Member

#244 should resolve this for selecting 0 trains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants