-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add materialize
to materialize lazy arrays
#839
Comments
A question is whether it's appropriate for an array API consuming library to materialize a lazy graph "behind the user's back", as it were. Such an operation could be quite expensive in general, and a user might be surprised to find that a seemingly innocuous function from something like scipy is doing this. On the other hand, if an algorithm fundamentally depends on array values in a loop, there's no way it can be implemented without something like this. So maybe the answer is we should just provide guidance that functions that use |
If I understand https://data-apis.org/array-api/draft/design_topics/lazy_eager.html correctly, the primary APIs that require materialization are So another option here would be to add a |
A couple of thoughts:
|
Thank you for starting this discussion @lucascolley, and thanks for tagging me!
Note that the signature here should probably be more like
Adding some more here:
I don't think this is the primary API at all, it's just an interesting special case where the return type is out of our hands. The primary API is as @lucascolley says, a
I agree - in Xarray we very rarely use I also agree with the other 2 points @hameerabbasi just made. Xarray has a new abstraction over dask, cubed (and maybe soon JAX) called a " |
Thanks all for the comments! Just tagging @jakevdp also who can maybe shed some light on JAX. |
I don't think |
I'm +1 on an API that allows simultaneous materialisation of multiple arrays, although I'd spell it slightly differently.
With this in mind, the signature I'd propose is |
One thing I'm unclear on: what is the difference between materialized and non-materialized arrays in terms of the array API? What array API operations can you do on one, but not on the other? |
I feel this is more driven by use-cases and performance. Some of these are outlined in #748 (comment) and #728. |
One we bumped into in SciPy is |
As I mentioned above, Additionally, the APIs that have data-dependent shapes are |
OK, thanks for the clarification. In that case, |
I'd like to add a different perspective, based on execution models. I think we have fundamentally three kinds:
(1) Eager execution model Examples of implementations:
Any "execute or materialize now" API would be a no-op. (2) Fully lazy execution model Examples of implementations:
Any "execute or materialize now" API would need to raise an exception. (3) Hybrid lazy/eager execution model Examples of implementations: This is the only mode where an "execute or materialize now" API may be needed. This is not a given though, which is clear from PyTorch not having any such As pointed out by @asmeurer above, there are only very few APIs that cannot be kept lazy ( For PyTorch, the way things work in hybrid mode is that if actual values are needed, the computation is done automatically. No syntax is needed for this. And there doesn't seem to be much of a downside to this. EDIT: see https://pytorch.org/docs/stable/export.html#existing-frameworks for a short summary of various PyTorch execution models. MLX is in the middle: it does have syntax to trigger evaluation ( For Dask, it chooses to require There is another important difference between PyTorch (and fully lazy libraries like JAX/ndonnx as well) vs. Dask I think:
My current assessment is:
Now we obviously do have an issue with Dask/Xarray/Cubed that we need to understand better and find a solution for. It's a hard puzzle. That seems to require more thought, and perhaps a higher-bandwidth conversation soon. The ad-hoc-ness is (as far as I understand it - I could well be missing something of course) going to remain a fundamental problem for any attempt at standardization. I'd be curious to hear from @TomNicholas or anyone else with more knowledge about Dask why something like a user opt-in to auto-trigger compute whenever possible isn't a good solution. |
@lithomas1 asked this in dask/dask#11356 and the response from @phofl was
@fjetter said in dask/dask#11298 (comment)
|
Thanks for the pointers @lucascolley. So that seems to be a fairly conclusive "we have some experience and won't do that" - which is fair enough. A few thoughts on those discussions:
def compute(x):
if is_dask_array(x):
x.compute()
return x
def some_func(x):
if compute(x).shape[0] > 5):
# we couldn't avoid the `if` conditional in this logic
... |
Thanks Ralf, that makes sense. I'm pretty convinced that we don't want to add As @asmeurer mentioned previously, we still need to decide in SciPy whether we are comfortable with doing |
Thanks Lucas. I'll reopen this for now to signal we're not done with this discussion. I've given my input, but at least @hameerabbasi and @TomNicholas seem to have needs that perhaps aren't met yet. We may also want to improve the documentation around this topic. |
This aligns with the point I was trying to make above (#839 (comment)), which is that a library like scipy calling So I think that if scipy encounters this situation in one of its functions, it should either do nothing, i.e., require the user to materialize the array themselves before calling the function, or register the function itself as a lazy function (but how that would work would be array library dependent). |
I think it should be possible to use the introspection API to add different modes, where we raise errors by default but a user can opt-in to allowing us to force computation. The same can be said for device transfers via DLPack. |
Preface
I do not think that I am the best person to champion this effort, as I am far from the most informed person here on Lazy arrays. I'm probably missing important things, but I would like to start this discussion as I think that it is an important topic.
The problem
The problem of mixing computation requiring data-dependent properties with lazy execution is discussed in detail elsewhere:
A possible solution
Add the function
materialize(x: Array)
to the top level of the API. Behaviour:Prior art
Concerns
device
kwargs in NumPy), but perhaps this is too obtrusive?Alternatives
compute*
or a method on the array object. Maybe with options for partial materialization (if that's a thing)?cc @TomNicholas @hameerabbasi @rgommers
The text was updated successfully, but these errors were encountered: