Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Sally <[email protected]>
  • Loading branch information
sergiodj and s-makin authored Oct 7, 2024
1 parent 344f9fd commit d2d0cfd
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions explanation/performance/perf-pgo.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,33 @@

It can be hard to do profiling of real world applications. Ideally, the profile should be generated by a representative workload of the program, but it's not always possible to simulate a representative workload. Moreover, the built-in instrumentation impacts the overall performance of the binary which introduces a performance penalty.

In order to address these problems, nowadays we use tools like `perf` to "observe" what the binary is doing externally (sampling it, by monitoring events using Linux kernel's PMU --- Performance Monitoring Unit), which makes the process more suitable to be used in production environments. This technique works better than the regular built-in instrumentation, but it still has a few drawbacks that we will expand later.
In order to address these problems, nowadays we use tools like `perf` to "observe" what the binary is doing externally (sampling it, by monitoring events using Linux kernel's Performance Monitoring Unit -- PMU), which makes the process more suitable to be used in production environments. This technique works better than the regular built-in instrumentation, but it still has a few drawbacks that we will expand later.

## Caveats

* The purpose of this guide is to provide some basic information about what PGO is. In order to do that, we will look at a simple example using OpenSSL (more specifically, the `openssl speed` command) and learn how to do basic profiling. We will not go into a deep dive on how to build the project, and it is assumed that the reader is comfortable with compilation, compiler flags and using the command line.
* The purpose of this guide is to provide some basic information about what PGO is and how it works. In order to do that, we will look at a simple example using OpenSSL (more specifically, the `openssl speed` command) and learn how to do basic profiling. We will not go into a deep dive on how to build the project, and it is assumed that the reader is comfortable with compilation, compiler flags and using the command line.

* Please note that, despite being a relatively popular technique, PGO is not always the best approach to optimize a program. The profiling data generated by the workload will be extremely tied to it, which means that the optimized program might actually have worse performance when other types of workloads are executed. There is not a one-size-fits-all solution for this problem, and sometimes the best approach might be to **not** use PGO after all.
* Despite being a relatively popular technique, PGO is not always the best approach to optimize a program. The profiling data generated by the workload will be extremely tied to it, which means that the optimized program might actually have worse performance when other types of workloads are executed. There is not a one-size-fits-all solution for this problem, and sometimes the best approach might be to **not** use PGO after all.

* If you plan to follow along, we recommend setting up a test environment for this experiment. The ideal setup involves using a bare metal machine because it's the more direct way to collect the performance metrics. If you would like to use a virtual machine (created using QEMU/libvirt, LXD, Multipass, etc.), it will likely only work on Intel-based processors due to how vPMU (Virtual Performance Monitoring Unit) works.
* If you plan to follow along, we recommend setting up a test environment for this experiment. The ideal setup involves using a bare metal machine because it's the more direct way to collect the performance metrics. If you would like to use a virtual machine (created using QEMU/libvirt, LXD, Multipass, etc.), it will likely only work on Intel-based processors due to how Virtual Performance Monitoring Unit (vPMU) works.

## `perf` and AutoFDO

Using `perf` to monitor a process and obtain data about its runtime workload produces data files in a specific binary format that we will call `perfdata`. Unfortunately, GCC doesn't understand this file format; instead, it expects a profile file in a format called `gcov`. To convert a `perfdata` file into a `gcov` one, we need to use a software called [`autofdo`](https://github.com/google/autofdo). This software expects the binary being profiled to obey certain constraints:

* The binary **cannot** be stripped of its debug symbols. `autofdo` does not support separate debug information files (i.e., it can't work with Ubuntu's `.ddeb` packages), and virtually all Ubuntu packages run `strip` doing their build in order to generate the `.ddeb` packages. If you intend to profile an Ubuntu package, please keep that in mind.
* The binary **cannot** be stripped of its debug symbols. `autofdo` does not support separate debug information files (i.e., it can't work with Ubuntu's `.ddeb` packages), and virtually all Ubuntu packages run `strip` during their build in order to generate the `.ddeb` packages.

* The debug information file(s) **cannot** be processed by `dwz`. This is a tool whose purpose is to compress the debug information generated when building a binary, and again, virtually all Ubuntu packages use it. For this reason, it is currently not possible to profile most of Ubuntu's packages without first rebuilding them to disable `dwz` from running.
* The debug information file(s) **cannot** be processed by `dwz`. This tool's purpose is to compress the debug information generated when building a binary, and again, virtually all Ubuntu packages use it. For this reason, it is currently not possible to profile most Ubuntu packages without first rebuilding them to disable `dwz` from running.

* We have to be mindful of the options we pass to `perf`, particularly when it comes to recording branch prediction events. The options will likely vary depending on whether you are using an Intel or AMD processor, for example.
* We must be mindful of the options we pass to `perf`, particularly when it comes to recording branch prediction events. The options will likely vary depending on whether you are using an Intel or AMD processor, for example.

On top of that, the current `autofdo` version in Ubuntu (`0.19-3build3`, at the time of this writing) is not recent enough to process the `perfdata` files we will generate. There is a PPA with a newer version of `autofdo` package for Ubuntu Noble [here](https://launchpad.net/~sergiodj/+archive/ubuntu/autofdo). If you are running another version of Ubuntu and want to install a newer version of `autofdo`, you will need to build the software manually (please refer to the [upstream repository](https://github.com/google/autofdo) for further instructions).
On top of that, the current `autofdo` version in Ubuntu (`0.19-3build3`, at the time of this writing) is not recent enough to process the `perfdata` files we will generate. There is a PPA with a newer version of `autofdo` package [for Ubuntu Noble](https://launchpad.net/~sergiodj/+archive/ubuntu/autofdo). If you are running another version of Ubuntu and want to install a newer version of `autofdo`, you will need to build the software manually (please refer to the [upstream repository](https://github.com/google/autofdo) for further instructions).

## A simple PGO scenario: `openssl speed`

PGO makes more sense when your software is CPU-bound, i.e., when it performs CPU intensive work and is not mostly waiting on I/O, for example. Even if your software spends time waiting on I/O, using PGO might still be helpful; its effects would be less noticeable, though.

OpenSSL has a built-in benchmark command called `openssl speed`, which tests the performance of its cryptographic algorithms. This is excellent for PGO because there is practically no I/O involved, and we are just constrained by how fast the CPU can run.
OpenSSL has a built-in benchmark command called `openssl speed`, which tests the performance of its cryptographic algorithms. This is excellent for PGO because there is practically no I/O involved, and we are only constrained by how fast the CPU can run.

### Running OpenSSL tests

Expand Down

0 comments on commit d2d0cfd

Please sign in to comment.