-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Report non-fatal errors from the WebNN timeline #778
Comments
Thanks @a-sully for the proposal. A couple of thoughts. Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU). Which WebNN backend is expected to fail after build() but before execution? Seems undesirable. Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it. |
That's more or less what I've proposed :) See
The bigger problem we're seeing right now is backends failing during graph execution. That being said, there's a class of failures where an inconsistency in system state (or assumed system state, in the example below) between
I agree it's undesirable, but I argue that it's unavoidable:
Could you elaborate on what you mean by this? |
@a-sully , thank you very much for putting this together. Re: Labeling object. I am always in favor giving web developers a way to label objects and use those labels in subsequent diagnostic output, or errors flagged by the browser. Should we derive
For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad? Would failing dispatches subsequently succeed if you used an input with different values or is the input object doomed to fail no matter what graph you use it with? Knowing this would inform which object we should put into an error state, or propagating error state to. When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next? |
Yes, I should have been more explicit meant to proposing this in the Tentative IDL section. I proposed adding
should be augmented by this: interface mixin MLObjectBase {
attribute USVString label;
};
MLTensor includes MLObjectBase;
MLGraph includes MLObjectBase;
The former is case 4 and the latter is case 3 from the State of the World section. Notably, it's hard to distinguish these cases at runtime:
I'm tempted to say we should always invalidate the
So, in the example above we'd invalidate
This relates to this open question:
Ideally the error message should point to the cause of the failure. If we always invalidate the cause of a failure, then developers can use string-matching to identify that The least-bad option I can think of (suggestions welcome!) is to store the "last error" on the partial interface MLTensor {
attribute USVString causeOfLastError;
}; Alternatively we could add some sort of error-checking getter to // If this dispatch fails, `graph1` is invalidated.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});
// This cannot be known synchronously and seems likely to be misused.
// If this is async, it is unnecessarily expensive.
graph1.isValid(); WDYT? |
Having
The "isValid" approach is similar to getError() in WebGL. This approach was abandoned in WebGPU because web developers often couldn't determine which object had caused an error (context or graph?). This uncertainty made it difficult for sites to respond appropriately and frequently led to excessive error checks scattered throughout the code. Another common approach is to register a call-back which is invoked for the specific errors the web developer can react to and if it goes unhandled, propagates to become a context lost. let errorQueue = ctx.createErrorQueue();
errorQueue.pushErrorFilter('internal');
ctx.dispatch(graph, inputs, outputs);
errorQueue.popErrorFilter(internalErrorHandler); |
@a-sully for dispatch specific errors: having a pushErrorScope/popErrorScope as @bbernhar describes is similar to what WebGPU does and seems like it gives us the best of both worlds. Though it can be improved by additionally providing the labels of the objects involved. The WebGPU Error Handling best practices is a good article on the subject. For case 4, is it possible that a particular If there exist platforms where |
Yes. I didn't explain this well, but I've been using "invalid" and "errored" to represent different error states. This proposal was initially aimed at the latter, but once we started talking about invalidating "Invalid" objects could be specified to behave as if their respective
Meanwhile, "errored" objects (for now, only
I think this is true of all the WebNN backends in the current Chromium implementation? It's true of CoreML and TFLite, and I assume it's also true of DML for case 2 failures? Having a promise similar to what we currently have on the ...there's then a question of whether we need to care about "errored" objects at all. The original reasoning for using this cascading error failure mechanism was to:
If we invalidate the The question is then whether we care to avoid exposing the contents of
@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors? |
Good question. WebGPU resources can be invalidated if they cannot be created (e.g., due to OOM) or if the device is lost or destroyed. If the operation is non-fatal (i.e., validation failure or OOM), they could also remain valid. However, 'non-fatal' does not include internal errors raised during queue operations—pipeline creation is the only exception where an 'internal error' is not considered a device loss AFAIK. Similarly, dispatch() could raise an |
Nothing. If a validation error happens in the GPU process, the error is raised to the error scope and the call is ignored.
In Chromium, there are places in the DML backend where errors during graph building can cause the If there exist platforms where an error during dispatch and readTensor/writeTensor, results in undefined tensor output and the browser is able to detect this has happened, we can have the browser clear the output tensors to defined values such as zeros. If we can subsequently determine with certainty that the graph will no longer produce valid output ever again, we should mark it as invalid and leave it in the same effective state as a destroyed graph. If the system gets into a state where it is not clear whether forward progress can be made and random tensors can be in undefined states, then making the context as "lost" and starting over might be the safest option. WebNN has been good at surfacing errors as early as possible during graph building. If that's not always possible due to platform limitations, then introducing an "errorScope" (like WebGPU does) or error queue with errors referring to labeled objects seems like the best alternative. |
The Problem (see #477)
Our current method for surfacing
dispatch()
errors is to "lose" theMLContext
. As I mentioned in #754 (comment) I don't think it makes sense for this to be the only option for surfacing errors fromdispatch()
:Losing the
MLContext
is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think theMLContext
is always the right blast radius for adispatch()
error.There is also no way whatsoever to surface an error from
writeTensor()
!State of the World
Here are examples of how I've observed
dispatch()
fail in the current Chromium implementation:MLContext
may indeed be the only optionMLContext
e.g. if you assume an OOM is imminent,MLGraph
e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOMMLGraphBuilder.build()
, but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:MLContext
is not a useful optionMLGraph
, especially if you're confident it will never execute successfullygather
ops (see Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:MLContext
is not a useful optiondispatch()
with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away theMLGraph
Observations
MLContext
(or the entire GPU process) would be usefuldispatch()
failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessiblewhere
operator may fail to hit the affected branch(es).MLGraph
is a reasonable (though not strictly necessary) response to examples 2, 3, and 4dispatch()
fails but its output tensors are never read back...dispatch()
fails but its output tensors are later overwritten by new data...readTensor()
importExternalBuffer()
Proposal
writeTensor()
,dispatch()
) catastrophically fails, continue to lose theMLContext
MLTensor
s, though possibly also anMLGraph
, TBD) are put into an errored statewriteTensor()
writes new dataExample:
Open Questions
graph1
be put into an errored state, too?graph1
will always fail to execute?importExternalBuffer()
method?GPUError
scopes will be able to handle this casecreateBuffer()
be made synchronous and use this error reporting mechanism?MLTensor.error
), since the errored state exists on the WebNN timeline. Is that sufficient?Tentative IDL:
The text was updated successfully, but these errors were encountered: