Doc improvements suggested in the meeting. (pytorch#1107)

* Doc improvements suggested in the meeting.
huamichaelchen · Sep 27, 2019 · 7ad7018 · 7ad7018
1 parent 67a0c03
commit 7ad7018
Show file tree

Hide file tree

Showing 3 changed files with 111 additions and 96 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,23 +1,10 @@
-## C++ Style Guide
+# Contribute To PyTorch/XLA
 
-`pytorch/xla` uses `clang-format-7` with a customized style config.
-If your PR touches the C++ source files, please run the following command before submmiting a PR.
-
-```Shell
-# If your PR only changes foo.cpp, run the following in xla/ folder
-clang-format-7 -i -style /PATH/TO/foo.cpp
-# To format all cpp files, run the follwoing in xla/ folder
-find -name '*.cpp' -o -name '*.h' | xargs clang-format-7 -i -style=file
-```
+We appreciate all contributions. If you are planning to contribute a bug fix for an open issue, please comment on the thread and we're happy to provide any guidance.
+You are very welcome to pick issues from [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels.
 
-## Python Style Guide
-
-`pytorch/xla` uses `yapf` with a customized style config.
-If your PR touches the Python source files, please run the following command before submmiting a PR.
-
-```Shell
-#TODO:
-```
+If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
+Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
 
 ## Building Manually
 
@@ -36,7 +23,7 @@ To build from source:
   git clone --recursive https://github.com/pytorch/xla.git
   ```
 
-## Building Docker Image
+### Building Docker Image
 
 * We provide a Dockerfile in `docker/` that you can use to build images as the
   following:
@@ -45,15 +32,15 @@ To build from source:
   docker build -t torch-xla -f docker/Dockerfile .
   ```
 
-## Building With Script
+### Building With Script
 
 * To build and install `torch` and `torch_xla`:
 
   ```Shell
   xla/scripts/build_torch_wheels.sh
   ```
 
-## Build From Source
+### Build From Source
 
 * Apply PyTorch patches:
 
@@ -101,4 +88,29 @@ To build from source:
   python setup.py install
   ```
 
+## Before Submiting A Pull Request:
+
+In `pytorch/xla` repo we enforce coding style for both C++ and Python files. Please try to format
+your code before submitting a pull request.
+
+### C++ Style Guide
+
+`pytorch/xla` uses `clang-format-7` with a customized style config.
+If your PR touches the C++ source files, please run the following command before submmiting a PR.
+
+```Shell
+# If your PR only changes foo.cpp, run the following in xla/ folder
+clang-format-7 -i -style /PATH/TO/foo.cpp
+# To format all cpp files, run the follwoing in xla/ folder
+find -name '*.cpp' -o -name '*.h' | xargs clang-format-7 -i -style=file
+```
+
+### Python Style Guide
+
+`pytorch/xla` uses `yapf` with a customized style config.
+If your PR touches the Python source files, please run the following command before submmiting a PR.
+
+```Shell
+#TODO:
+```
 
diff --git a/README.md b/README.md
@@ -140,7 +140,7 @@ post](https://cloud.google.com/blog/products/ai-machine-learning/googles-scalabl
 
 ---
 
-# Build Manually
+## Build Manually
 
 Please note that we have nightly releases available so users usually don't have to build manually. This is mainly for OSS contributors.
 Please refer to [contribution guide](CONTRIBUTING.md) for instructions to build from source.
@@ -173,23 +173,18 @@ it is suggested for you to select the _Nightly_ builds when you create a Cloud T
 
 Then run `test/run_tests.sh` and `test/cpp/run_tests.sh` to verify the setup is working.
 
-## PyTorch/XLA API And Best Practice
+# PyTorch/XLA API And Best Practice
 
 Please check out the [API Guideline](API_GUIDE.md) for the best practices to write models to run on TPU & TPU Pod devices.
 
-## Troubleshooting
+# Troubleshooting
 
 If you see bad performance when using PyTorch/XLA, please check out the [troubleshooting guide](TROUBLESHOOTING.md) for how to avoid common pitfalls and how to debug.
 
-## Communication
+# Communication
 
 We use github issues to communicate with users and open source contributors. Please file an issue for questions, bug reports, feature requests, install issues, RFCs, thoughts, etc.
 
-## Contributing
+# Contributing
 
-We appreciate all contributions. If you are planning to contribute bug fix for an open issue, please comment on the thread and we're happy to provide any guidance. You are very welcome to pick issues from `good first issue` and `help wanted` labels.
-
-If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
-Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
-
-Please refer to [contribution guide](CONTRIBUTING.md) for detailed guidelines to submit PRs.
+Please refer to [contribution guide](CONTRIBUTING.md) for detailed instructions.
diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md
@@ -1,7 +1,68 @@
-# Performance Caveats
+# Troubleshooting
+
+Note that the information in this section is subject to be removed in future releases of the _PyTorch/XLA_ software,
+since many of them are peculiar to a given internal implementation which might change.
+
+To diagnose issues, we can use the execution metrics and counters provided by _PyTorch/XLA_
+The **first thing** to check when model is slow is to generate a metrics report.
+
+Metrics report is extremely helpful in diagonsing issues. Please try to include it in your bug
+report sent to us if you have it.
+
+## Get A Metrics Report
+
+Put the following line in your program to generate a report:
+
+```Python
+print(torch_xla._XLAC._xla_metrics_report())
+```
+
+## Understand The Metrics Report
+
+The report includes things like:
+- how many time we issue _XLA_ compilations and time spent on issuing.
+- how many times we execute and time spent on execution
+- how many device data handles we create/destroy etc.
+
+This information is reported in terms of percentiles of the samples. An example is:
+
+```
+Metric: CompileTime
+  TotalSamples: 202
+  Counter: 06m09s401ms746.001us
+  ValueRate: 778ms572.062us / second
+  Rate: 0.425201 / second
+  Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
+```
+
+We also provide counters, which are named integer variables which track internal software status. For example:
+
+```
+Counter: CachedSyncTensors
+  Value: 395
+```
+
+In this report, any counter that starts with `aten::`
+indicates a context switch between the XLA device and CPU, which can be a
+potential performance optimization area in the model code.
+
+Counters are useful to understand which operations are routed back to the CPU engine of _PyTorch_.
+They are fully qualified with their C++ namespace:
+
+```
+Counter: aten::nonzero
+  Value: 33
+```
+
+If you see `aten::` ops other than `nonzero` and `_local_scalar_dense`, that usually means a missing
+lowering in PyTorch/XLA. Feel free to open a feature request for it on [GitHub issues](https://github.com/pytorch/xla/issues).
+
+## Known Performance Caveats
 
 PyTorch/XLA behaves semantically like regular PyTorch and XLA tensors share the full tensor interface with CPU & GPU tensors.
-However, constraints in XLA/hardware and the lazy evaluation model suggest certain patterns might result in bad performance:
+However, constraints in XLA/hardware and the lazy evaluation model suggest certain patterns might result in bad performance.
+
+If your model shows bad performance, keep in mind the following caveats:
 
 1.  **XLA/TPU yield degraded performance with too many recompilations.**
 
@@ -56,68 +117,15 @@ However, constraints in XLA/hardware and the lazy evaluation model suggest certa
    * When dataset is small, and there are too few steps, this may result in a no-op epoch. Therefore, it is better to use
    small batch sizes in those cases.
 
-# Debugging
-
-Sometimes, bad things happen and a deeper look into the _PyTorch/TPU_ stack is necessary.
-In order to do that, _PyTorch/TPU_ has a series of environment variables and function calls
-which can help understading its internal behavior.
-
-Note that the information in this section is subject to be removed in future releases of
-the _PyTorch/TPU_ software, since many of them are peculiar to a given internal implementation
-which might change.
-
-## Metrics Report
-
-The _PyTorch/TPU_ stack keeps a series of metrics and counters during its execution, and
-the following API returns a string representation of them:
-
-```Python
-torch_xla._XLAC._xla_metrics_report()
-```
-
-Printing out that information can help during the debug phases and while reporting issues.
-
-The information included within the metrics report includes things like:
-- how many time we issue _XLA_ compilations and time spent on issuing.
-- how many times we execute and time spent on execution
-- how many device data handles we create/destroy etc.
-
-This information is reported in terms of percentiles of the samples. An example is:
-
-```
-Metric: CompileTime
-  TotalSamples: 202
-  Counter: 06m09s401ms746.001us
-  ValueRate: 778ms572.062us / second
-  Rate: 0.425201 / second
-  Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us
-```
-
-We also provide counters, which are named integer variables which track internal software status. For example:
-
-```
-Counter: CachedSyncTensors
-  Value: 395
-```
+## More Debugging Tools
 
-In this report, any counter that starts with `aten::`
-indicates a context switch between the XLA device and CPU, which can be a
-potential performance optimization area in the model code.
-
-Counters are useful to understand which operations are routed back to the CPU engine of _PyTorch_.
-They are fully qualified with their C++ namespace:
-
-```
-Counter: aten::nonzero
-  Value: 33
-```
-
-If you see `aten::` ops other than `nonzero` and `_local_scalar_dense`, that usually means a missing
-lowering in PyTorch/XLA. Feel free to open a feature request for it on [GitHub issues](https://github.com/pytorch/xla/issues).
+We don't expect users to use tools in this section to debug their models. But we might ask for
+them when you submit a bug report since they provide additional information that metrics report
+doesn't have.
 
-## Environment Variables
+### Environment Variables
 
-There are also a number of environment variables which control the behavior of the _PyTorch/TPU_
+There are also a number of environment variables which control the behavior of the _PyTorch/XLA_
 software stack.
 
 Setting such variables will cause different degrees of performance degradation, so they should
@@ -140,12 +148,12 @@ only be enabled for debugging.
 * ```XLA_METRICS_FILE```: If set, the path to a local file where the internal metrics will be
   saved at every step. Metrics will be appended to the file, if already existing.
 
-* ```GET_TENSORS_OPBYOP```: Enables pure _OpByOp_ dispatch. The _PyTorch/TPU_ software tries to
+* ```GET_TENSORS_OPBYOP```: Enables pure _OpByOp_ dispatch. The _PyTorch/XLA_ software tries to
   fuse together many _PyTorch_ operations into a single computation graph, but sometimes, either
   for debugging, or in case the _PyTorch_ code have a very dynamic nature (in shapes or graph
   terms), it is better to force the execution in _OpByOp_ mode (every IR node is lowered into
   a separate _XLA_ computation, and chain-executed). This environment variable, if set to 1,
-  enables _OpByOp_ during the "get tensors" operation (the operation used by _PyTorch/TPU_ to
+  enables _OpByOp_ during the "get tensors" operation (the operation used by _PyTorch/XLA_ to
   fetch intermediate values back from the _TPU_ device into _PyTorch_ CPU tensors).
 
 * ```SYNC_TENSORS_OPBYOP```: The same as _GET_TENSORS_OPBYOP_ but for "sync tensors" operation
@@ -163,7 +171,7 @@ only be enabled for debugging.
   expensive, so setting this flag might help. It should be verified by the user that truncating
   to 32bit values is a valid operation according to the use of _PyTorch_ _Long_ values in it.
 
-## Retrieving Stack Traces
+### Retrieving Stack Traces
 
 In the event that the _PyTorch_ process is hanging, it might be useful to include the stack
 traces together with the GitHub issue.