Skip to content

Commit

Permalink
Cleanup TROUBLESHOOTING.md a bit (pytorch#1105)
Browse files Browse the repository at this point in the history
  • Loading branch information
asuhan authored and ailzhang committed Sep 26, 2019
1 parent bd868b3 commit bceac47
Showing 1 changed file with 13 additions and 17 deletions.
30 changes: 13 additions & 17 deletions TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ However, constraints in XLA/hardware and the lazy evaluation model suggest certa

_Possible sources_:
* Direct or indirect uses of `nonzero` introduce dynamic shapes; for example, masked indexing `base[index]` where `index` is a mask tensor.
* Loops with a different number of iterations between steps can result in different execution graphs thus require recompilations.
* Loops with a different number of iterations between steps can result in different execution graphs, thus require recompilations.

_Solution_:
* Tensor shapes should be the same between iterations, or a low number of shape variations should be used.
Expand All @@ -25,11 +25,11 @@ However, constraints in XLA/hardware and the lazy evaluation model suggest certa

_Possible sources_:

- The `item()` operation explicitly asks for evaluating the result. Don't use it unless it's necessary.
- The `item()` operation explicitly asks to evaluate the result. Don't use it unless it's necessary.

_Solution_:

- For most ops we can lower them to XLA to fix it. Checkout [metrics report section](#metrics-report) to find out the missing ops and open a feature request on github.
- For most ops we can lower them to XLA to fix it. Checkout [metrics report section](#metrics-report) to find out the missing ops and open a feature request on [GitHub](https://github.com/pytorch/xla/issues).
- Even when a PyTorch tensor is known as a scalar, avoid using `tensor.item()`. Keep it as a tensor and use tensor operations on it.
- Use `torch.where` to substitute control flow when applicable.
E.g. The control flow with `item()` used in [clip_grad_norm_](https://github.com/pytorch/pytorch/blob/de19eeee99a2a282fc441f637b23d8e50c75ecd1/torch/nn/utils/clip_grad.py#L33) can be simply replaced by `torch.where` with dramatical performance improvement.
Expand Down Expand Up @@ -58,11 +58,11 @@ However, constraints in XLA/hardware and the lazy evaluation model suggest certa

# Debugging

Sometimes bad things happen and a deeper look into the _PyTorch/TPU_ stack is necessary.
Sometimes, bad things happen and a deeper look into the _PyTorch/TPU_ stack is necessary.
In order to do that, _PyTorch/TPU_ has a series of environment variables and function calls
which can help understading its internal behavior.

Note that the infromation in this section is subject to be removed in future releases of
Note that the information in this section is subject to be removed in future releases of
the _PyTorch/TPU_ software, since many of them are peculiar to a given internal implementation
which might change.

Expand All @@ -77,13 +77,12 @@ torch_xla._XLAC._xla_metrics_report()

Printing out that information can help during the debug phases and while reporting issues.

The information included within the metrics report include things like
The information included within the metrics report includes things like:
- how many time we issue _XLA_ compilations and time spent on issuing.
- how many times we execute and time spent on execution
- how many device data handles we create/destroy etc...
- how many device data handles we create/destroy etc.

These information is reported in terms of percentiles of the samples.
An example is:
This information is reported in terms of percentiles of the samples. An example is:

```
Metric: CompileTime
Expand All @@ -94,9 +93,7 @@ Metric: CompileTime
Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
```

The _PyTorch/TPU_ stack also has counters, which are named integer variables tracks
internal software status.
Example:
We also provide counters, which are named integer variables which track internal software status. For example:

```
Counter: CachedSyncTensors
Expand All @@ -107,17 +104,16 @@ In this report, any counter that starts with `aten::`
indicates a context switch between the XLA device and CPU, which can be a
potential performance optimization area in the model code.

Counters are useful to understand which operations the _PyTorch/TPU_ stack is routing
back to the CPU engine of _PyTorch_.
Things which looks like a _C++_ namespace are part of this category:
Counters are useful to understand which operations are routed back to the CPU engine of _PyTorch_.
They are fully qualified with their C++ namespace:

```
Counter: aten::nonzero
Value: 33
```

If you see `aten::` ops other than `nonzero` and `_local_scalar_dense`, that usually means a missing
lowering in PyTorch/XLA, feel free to open a feature request for it on github issues.
lowering in PyTorch/XLA. Feel free to open a feature request for it on [GitHub issues](https://github.com/pytorch/xla/issues).

## Environment Variables

Expand Down Expand Up @@ -170,7 +166,7 @@ only be enabled for debugging.
## Retrieving Stack Traces

In the event that the _PyTorch_ process is hanging, it might be useful to include the stack
traces together with the _Github_ issue.
traces together with the GitHub issue.

First thing is to find out which PID the _PyTorch_ process is associated with. Using the ```ps```
command it is possible to find that information. It will be a _python_ process running your
Expand Down

0 comments on commit bceac47

Please sign in to comment.