You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Criterion gives highly precise measurements. Given two measurements of a simple microbenchmark A, A1,A2, if:
both were taking starting from a similar machine state,
both report very high R^2 values, and
both were run for a long -L20 or higher,
then A1/A2 should be close estimates, right? No, unfortunately.
When measuring 1279 benchmarks on Stackage, we have found that it's very common to have greater than 10% variation between consecutive runs of the same, small, deterministic benchmark.
Anecdotally, we seem to get more stable numbers from individual high --iters runs, than from linear regression. I don't have a good explanation yet. Perhaps the non-determinism in the selection of data points (on the X axis) is having more of an effect than we expected? Certainly, when there is a bad R^2, we've seen exactly where it starts running has a big effect.
(On a related note, it would be great to have some assistance when using criterion with the kinds of things Krun controls, like waiting for the machine to cool down to a baseline temperature before starting a run.)
The text was updated successfully, but these errors were encountered:
we seem to get more stable numbers from individual high --iters runs, than from linear regression.
This matches my experience. I've done a simple experiment as an example: 60 benchmarks of bash -c "a=0; for i in {1..500000}; do (( a += RANDOM )); done" with the bench tool (which uses criterion under the hood). The code is on gist. Results from a computer without CPU frequency scaling, nonessential daemons or a desktop environment running:
Interquartile range / Median
Range / Median
Least-squares slope
0.3%
1.7%
Theil-Sen slope
0.3%
1.2%
Mean
0.2%
0.9%
Median of means
0.2%
0.9%
Minimum of means
0.1%
0.5%
Quartile 1 of means
0.1%
0.5%
Quartile 3 of means
0.3%
1.0%
(Note that the minimum, median and quartiles are not of individual runs, but of the mean loop iteration times. Getting the true quartiles is currently impossible — #165 tracks this.)
I remember the relative reliability of various statistics in other, less contrived benchmarks being similar to this. So while R² can be useful for checking whether there are anomalies, the slope from linear regression seems useless since the mean provides the same information but with much less variation between benchmarks. Am I missing something?
Criterion gives highly precise measurements. Given two measurements of a simple microbenchmark
A
,A1
,A2
, if:-L20
or higher,then A1/A2 should be close estimates, right? No, unfortunately.
When measuring 1279 benchmarks on Stackage, we have found that it's very common to have greater than 10% variation between consecutive runs of the same, small, deterministic benchmark.
Anecdotally, we seem to get more stable numbers from individual high
--iters
runs, than from linear regression. I don't have a good explanation yet. Perhaps the non-determinism in the selection of data points (on the X axis) is having more of an effect than we expected? Certainly, when there is a bad R^2, we've seen exactly where it starts running has a big effect.@RyanGlScott and @vollmerm have been working on this.
(On a related note, it would be great to have some assistance when using criterion with the kinds of things Krun controls, like waiting for the machine to cool down to a baseline temperature before starting a run.)
The text was updated successfully, but these errors were encountered: