Skip to content

Commit

Permalink
Doc improvements suggested in the meeting. (pytorch#1107)
Browse files Browse the repository at this point in the history
* Doc improvements suggested in the meeting.
  • Loading branch information
ailzhang authored Sep 27, 2019
1 parent 67a0c03 commit 7ad7018
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 96 deletions.
54 changes: 33 additions & 21 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,10 @@
## C++ Style Guide
# Contribute To PyTorch/XLA

`pytorch/xla` uses `clang-format-7` with a customized style config.
If your PR touches the C++ source files, please run the following command before submmiting a PR.

```Shell
# If your PR only changes foo.cpp, run the following in xla/ folder
clang-format-7 -i -style /PATH/TO/foo.cpp
# To format all cpp files, run the follwoing in xla/ folder
find -name '*.cpp' -o -name '*.h' | xargs clang-format-7 -i -style=file
```
We appreciate all contributions. If you are planning to contribute a bug fix for an open issue, please comment on the thread and we're happy to provide any guidance.
You are very welcome to pick issues from [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels.

## Python Style Guide

`pytorch/xla` uses `yapf` with a customized style config.
If your PR touches the Python source files, please run the following command before submmiting a PR.

```Shell
#TODO:
```
If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.

## Building Manually

Expand All @@ -36,7 +23,7 @@ To build from source:
git clone --recursive https://github.com/pytorch/xla.git
```

## Building Docker Image
### Building Docker Image

* We provide a Dockerfile in `docker/` that you can use to build images as the
following:
Expand All @@ -45,15 +32,15 @@ To build from source:
docker build -t torch-xla -f docker/Dockerfile .
```

## Building With Script
### Building With Script

* To build and install `torch` and `torch_xla`:

```Shell
xla/scripts/build_torch_wheels.sh
```

## Build From Source
### Build From Source

* Apply PyTorch patches:

Expand Down Expand Up @@ -101,4 +88,29 @@ To build from source:
python setup.py install
```

## Before Submiting A Pull Request:

In `pytorch/xla` repo we enforce coding style for both C++ and Python files. Please try to format
your code before submitting a pull request.

### C++ Style Guide

`pytorch/xla` uses `clang-format-7` with a customized style config.
If your PR touches the C++ source files, please run the following command before submmiting a PR.

```Shell
# If your PR only changes foo.cpp, run the following in xla/ folder
clang-format-7 -i -style /PATH/TO/foo.cpp
# To format all cpp files, run the follwoing in xla/ folder
find -name '*.cpp' -o -name '*.h' | xargs clang-format-7 -i -style=file
```

### Python Style Guide

`pytorch/xla` uses `yapf` with a customized style config.
If your PR touches the Python source files, please run the following command before submmiting a PR.

```Shell
#TODO:
```

17 changes: 6 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ post](https://cloud.google.com/blog/products/ai-machine-learning/googles-scalabl

---

# Build Manually
## Build Manually

Please note that we have nightly releases available so users usually don't have to build manually. This is mainly for OSS contributors.
Please refer to [contribution guide](CONTRIBUTING.md) for instructions to build from source.
Expand Down Expand Up @@ -173,23 +173,18 @@ it is suggested for you to select the _Nightly_ builds when you create a Cloud T
Then run `test/run_tests.sh` and `test/cpp/run_tests.sh` to verify the setup is working.
## PyTorch/XLA API And Best Practice
# PyTorch/XLA API And Best Practice
Please check out the [API Guideline](API_GUIDE.md) for the best practices to write models to run on TPU & TPU Pod devices.
## Troubleshooting
# Troubleshooting
If you see bad performance when using PyTorch/XLA, please check out the [troubleshooting guide](TROUBLESHOOTING.md) for how to avoid common pitfalls and how to debug.
## Communication
# Communication
We use github issues to communicate with users and open source contributors. Please file an issue for questions, bug reports, feature requests, install issues, RFCs, thoughts, etc.
## Contributing
# Contributing
We appreciate all contributions. If you are planning to contribute bug fix for an open issue, please comment on the thread and we're happy to provide any guidance. You are very welcome to pick issues from `good first issue` and `help wanted` labels.

If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.

Please refer to [contribution guide](CONTRIBUTING.md) for detailed guidelines to submit PRs.
Please refer to [contribution guide](CONTRIBUTING.md) for detailed instructions.
136 changes: 72 additions & 64 deletions TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,68 @@
# Performance Caveats
# Troubleshooting

Note that the information in this section is subject to be removed in future releases of the _PyTorch/XLA_ software,
since many of them are peculiar to a given internal implementation which might change.

To diagnose issues, we can use the execution metrics and counters provided by _PyTorch/XLA_
The **first thing** to check when model is slow is to generate a metrics report.

Metrics report is extremely helpful in diagonsing issues. Please try to include it in your bug
report sent to us if you have it.

## Get A Metrics Report

Put the following line in your program to generate a report:

```Python
print(torch_xla._XLAC._xla_metrics_report())
```

## Understand The Metrics Report

The report includes things like:
- how many time we issue _XLA_ compilations and time spent on issuing.
- how many times we execute and time spent on execution
- how many device data handles we create/destroy etc.

This information is reported in terms of percentiles of the samples. An example is:

```
Metric: CompileTime
TotalSamples: 202
Counter: 06m09s401ms746.001us
ValueRate: 778ms572.062us / second
Rate: 0.425201 / second
Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
```

We also provide counters, which are named integer variables which track internal software status. For example:

```
Counter: CachedSyncTensors
Value: 395
```

In this report, any counter that starts with `aten::`
indicates a context switch between the XLA device and CPU, which can be a
potential performance optimization area in the model code.

Counters are useful to understand which operations are routed back to the CPU engine of _PyTorch_.
They are fully qualified with their C++ namespace:

```
Counter: aten::nonzero
Value: 33
```

If you see `aten::` ops other than `nonzero` and `_local_scalar_dense`, that usually means a missing
lowering in PyTorch/XLA. Feel free to open a feature request for it on [GitHub issues](https://github.com/pytorch/xla/issues).

## Known Performance Caveats

PyTorch/XLA behaves semantically like regular PyTorch and XLA tensors share the full tensor interface with CPU & GPU tensors.
However, constraints in XLA/hardware and the lazy evaluation model suggest certain patterns might result in bad performance:
However, constraints in XLA/hardware and the lazy evaluation model suggest certain patterns might result in bad performance.

If your model shows bad performance, keep in mind the following caveats:

1. **XLA/TPU yield degraded performance with too many recompilations.**

Expand Down Expand Up @@ -56,68 +117,15 @@ However, constraints in XLA/hardware and the lazy evaluation model suggest certa
* When dataset is small, and there are too few steps, this may result in a no-op epoch. Therefore, it is better to use
small batch sizes in those cases.

# Debugging

Sometimes, bad things happen and a deeper look into the _PyTorch/TPU_ stack is necessary.
In order to do that, _PyTorch/TPU_ has a series of environment variables and function calls
which can help understading its internal behavior.

Note that the information in this section is subject to be removed in future releases of
the _PyTorch/TPU_ software, since many of them are peculiar to a given internal implementation
which might change.

## Metrics Report

The _PyTorch/TPU_ stack keeps a series of metrics and counters during its execution, and
the following API returns a string representation of them:

```Python
torch_xla._XLAC._xla_metrics_report()
```

Printing out that information can help during the debug phases and while reporting issues.

The information included within the metrics report includes things like:
- how many time we issue _XLA_ compilations and time spent on issuing.
- how many times we execute and time spent on execution
- how many device data handles we create/destroy etc.

This information is reported in terms of percentiles of the samples. An example is:

```
Metric: CompileTime
TotalSamples: 202
Counter: 06m09s401ms746.001us
ValueRate: 778ms572.062us / second
Rate: 0.425201 / second
Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
```

We also provide counters, which are named integer variables which track internal software status. For example:

```
Counter: CachedSyncTensors
Value: 395
```
## More Debugging Tools

In this report, any counter that starts with `aten::`
indicates a context switch between the XLA device and CPU, which can be a
potential performance optimization area in the model code.

Counters are useful to understand which operations are routed back to the CPU engine of _PyTorch_.
They are fully qualified with their C++ namespace:

```
Counter: aten::nonzero
Value: 33
```

If you see `aten::` ops other than `nonzero` and `_local_scalar_dense`, that usually means a missing
lowering in PyTorch/XLA. Feel free to open a feature request for it on [GitHub issues](https://github.com/pytorch/xla/issues).
We don't expect users to use tools in this section to debug their models. But we might ask for
them when you submit a bug report since they provide additional information that metrics report
doesn't have.

## Environment Variables
### Environment Variables

There are also a number of environment variables which control the behavior of the _PyTorch/TPU_
There are also a number of environment variables which control the behavior of the _PyTorch/XLA_
software stack.

Setting such variables will cause different degrees of performance degradation, so they should
Expand All @@ -140,12 +148,12 @@ only be enabled for debugging.
* ```XLA_METRICS_FILE```: If set, the path to a local file where the internal metrics will be
saved at every step. Metrics will be appended to the file, if already existing.

* ```GET_TENSORS_OPBYOP```: Enables pure _OpByOp_ dispatch. The _PyTorch/TPU_ software tries to
* ```GET_TENSORS_OPBYOP```: Enables pure _OpByOp_ dispatch. The _PyTorch/XLA_ software tries to
fuse together many _PyTorch_ operations into a single computation graph, but sometimes, either
for debugging, or in case the _PyTorch_ code have a very dynamic nature (in shapes or graph
terms), it is better to force the execution in _OpByOp_ mode (every IR node is lowered into
a separate _XLA_ computation, and chain-executed). This environment variable, if set to 1,
enables _OpByOp_ during the "get tensors" operation (the operation used by _PyTorch/TPU_ to
enables _OpByOp_ during the "get tensors" operation (the operation used by _PyTorch/XLA_ to
fetch intermediate values back from the _TPU_ device into _PyTorch_ CPU tensors).

* ```SYNC_TENSORS_OPBYOP```: The same as _GET_TENSORS_OPBYOP_ but for "sync tensors" operation
Expand All @@ -163,7 +171,7 @@ only be enabled for debugging.
expensive, so setting this flag might help. It should be verified by the user that truncating
to 32bit values is a valid operation according to the use of _PyTorch_ _Long_ values in it.

## Retrieving Stack Traces
### Retrieving Stack Traces

In the event that the _PyTorch_ process is hanging, it might be useful to include the stack
traces together with the GitHub issue.
Expand Down

0 comments on commit 7ad7018

Please sign in to comment.