Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peformance reporting #73

Open
Manfred opened this issue Nov 10, 2019 · 2 comments
Open

Peformance reporting #73

Manfred opened this issue Nov 10, 2019 · 2 comments

Comments

@Manfred
Copy link
Collaborator

Manfred commented Nov 10, 2019

Currently there is a difference between PerfCheck and PerfCheck CI reporting.

PerfCheck requires at least 10% change in latency to report something faster or slower. PerfCheck CI just checks if it was faster or slower.

I believe it's important to have a solid computation and identical output between the tools. We want users to be able to trust the summary in the report at all times.

This issue will not contain actionable changes but is more of a summary of observations, insights, and ideas.

Let's start with some details about measuring requests in our current situation.

  1. The first request triggers source code loading, configuration loading, and other initialization in the application. It's almost always going to be the slowest request. This measurement is not representative for the ‘speed’ of the Rails controller or action but is still relevant because it includes setup time which also influences server load in production.
  2. After a few requests the query and code paths in the application will start to warm up. This might reduce processing time, but is not necessarily representative for the application in production.
  3. Because of variance we want to perform multiple requests. Theoretically this should cause the mean latency to converge to a representative value.

Users want to be able to interpret outcomes when comparing performance between branches or paths.

  1. People need a better, same, or worse verdict. The verdict is used by people will less understanding about the technical details or when showing them in a summary over multiple runs.
  2. Any automated interpretation is not going to be 100% accurate. A developer should be able to use the measurements and summary statistics to form their own opinion.
  3. A developer should ideally be able to see what contributed to the outcome. For example, they should be able to see measurements and summary statistics for database time and CPU load.

When comparing branches or paths a user will want to see a number of sections.

  1. Is the relative performance about the same, better, or worse?
  2. How much faster or slower is relative performance?
  3. How big is the absolute difference?
  4. Show all measurements in a meaningful graphical way so I can get a feeling for the data.
@Manfred
Copy link
Collaborator Author

Manfred commented Nov 10, 2019

Assuming the request latency has a normal distribution we can show the median and standard deviation and compare those. The difference could be expressed in relation to the standard deviation. We can experiment with including or excluding the first few requests.

One way to work around the variance without dismissing the first request (or first two requests) is to show the cumulative latency over all the requests and compare that to the other branch or request path. We can do the same for database time, query count, CPU time, etc.

To show all measurements we can use a scatter plot (x = request index, y = latency) and show test cases in different colors.

@sudara
Copy link
Member

sudara commented Nov 11, 2019

Appreciate all the thought here!

We should definitely start with delivering what's has been working in recent history, which is a simple relative "change factor" like "1.2x slower" or "about the same." This was the end result of showing different types of data, with different thresholds and was relatively stable, matching user expectations.

PerfCheck CI just checks if it was faster or slower.

I believe the daemon actually had 20% as the threshold, 10% was not enough, so we should use that as the starting place in CI. https://github.com/rubytune/perf_check_daemon/blob/master/config/daemon.yml.example#L20-L23

And we dropped the first two requests. Let's keep doing that for now, as that was driven by usage and experience.

The deamon (and PerfCheck CI) is intended to be configurable per app deployment on what "meaningful" difference is needed to trigger a confident result, as it will depend completely on what the datasets look like. On the existing target app, we are dealing with endpoints taking 2-10+ seconds, issuing 100s of queries, so the variance seen on an empty fast app is background noise and not signal. In other words, we need to be able to configure CI appropriately so we are encouraging focus on the "broad strokes" and highlighting meaningful signal and guide people towards going for the larger strokes.

This doesn't mean we shouldn't improve accuracy. But it means further solutions and tweaks should be driven by a working user feedback loop on the target app — in other words, we need to wait to hear from users that they tested two equivalent branches and saw a differing number, and we need to have some data to play with before we invest into further calibration/optimization.

stddev is a great idea, especially to display in an "advanced" interpretation of the results down the line. 80% of people are going to be confused by the added information if we lead with it (people don't understand basic stats), so we should be careful in how we choose to present it. If it's just numerical, we will need to include some educational clarification about what that means.

Primary benefits of keeping the "headline result" simple (even at the cost of sometimes being less detailed/accurate):

  1. We want to guide people towards the 1-2 things to attend to and optimize.
  2. People will naturally pick the numbers that look best / least bad, so instead of presenting a buffet of viewpoints, we should very clearly have an opinion about what the "judgement" is. This is a very different type of work than for example publishing public benchmarks, where you want a complete picture. We want to "own" the fact that these measurements are rough, don't reflect production directly, should just be a way to get very simple "guidance" on whether something needs more looking into.
  3. When presented with multiple views of the data, confidence in a simple end result can be eroded or "explained away"

In the end, we are saying the same things. This is important stuff, and we should experiment with potential improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants