Transient "Sample is empty" runtime error. #161

kindaro · 2017-10-05T21:23:02Z

Let me start with the terminal output:

% ./CriterionRight.hs
benchmarking const
time                 57.80 ns   (57.02 ns .. 58.37 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 NaN s      (NaN s .. NaN s)
std dev              0.0 s      (0.0 s .. 0.0 s)
variance introduced by outliers: -9223372036854775808% (severely inflated)

CriterionRight.hs: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-FhOftJvY6wWEPvUcnpp3zH:Statistics.Quantile

This issue is transient: I launched my script about 10 times tonight and 3 out of the attempts ended like this.

The code I use can be reviewed here: github.com/kindaro/criterion-wut.

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2017-10-06T00:20:00Z

I suspect this might have a symptom in common in #160, but again, it's hard for me to tell since I can't reproduce this on my machine (answers to the questions in #160 (comment) would be helpful).

One hunch is that there's NaNs in the result, and it only happens sporadically. There's a known issue where some computations can trigger a bug in statistics' linear regression sampling and return NaN. However, these are typically long-running computations that only return a few data points (all of which are identical) per run, so the fact that it would show up here (where the program finishes in a matter of nanoseconds) would be rather unusual.

patrickdoc · 2018-01-13T08:40:53Z

I can replicate this bug consistently using the code here: https://github.com/patrickdoc/criterion-bug

I did some tracing, and found the issue to be with this block of code
https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Analysis.hs#L151-L154

When I run the benchmark, the overhead adjustment makes all of the measurements negative. They are then filtered out causing the classifyOutliers input to be empty. The criterion I included in my repo is just current master with a few comments in the above file and a changed version number because I am not yet savvy with new-build and the cabal.project file. It's not necessary, but I found it nice to be able to quickly tweak the code.

RyanGlScott · 2018-01-14T19:23:30Z

Thanks for digging into this, @patrickdoc. I wasn't able to reproduce the bug using your repo with an unmodified criterion, but one thing that did cause the bug to trigger was by changing threshold:

https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Measurement.hs#L220-L231

From 0.03 to 0.0:

$ ./dist/build/criterion-bug/criterion-bug
benchmarking first
time                 7.525 ns   (1.998 ns .. 12.20 ns)
                     0.224 R²   (0.017 R² .. 0.570 R²)
mean                 NaN s      (NaN s .. NaN s)
std dev              0.0 s      (0.0 s .. 0.0 s)
variance introduced by outliers: -9223372036854775808% (severely inflated)

criterion-bug: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-7Y0lYtBgBYcB9s5M0K3L16:Statistics.Quantile

RyanGlScott · 2018-01-14T19:32:54Z

I might be a bit out of my element in here in trying to debug this, so I'll ask for help.

@harendra-kumar, I noticed that you've changed this sort of code in vincenthz/hs-gauge@92436a4#diff-5ac48df8dda31e28f3b275e7dd76177c. Would it make sense to adopt the same change in criterion (or does this only make sense in the context of gauge)?

harendra-kumar · 2018-01-14T20:17:06Z

@patrickdoc is correct, this error is due to all samples getting filtered out. There are two issues here. The first one is the incorrect use of overhead, it does not make sense, maybe it is leftover of some old code, we do not need this that is what I did in that gauge commit.

The second one is the use of G.tail here which filters one more sample, I think this may have been introduced to filter out the deviation introduced by the first iteration. The first iteration usually has a large deviation due to evaluations that might occur once and then reused in subsequent iterations. But if you notice there is code in runBenchmark as well https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Measurement.hs#L271 to get rid of the first iteration. So we are doing this twice, we can do it at just one place instead.

To answer your question, I think it makes sense to get rid of the overhead stuff, I guess you will have to make changes in runBenchmark as well. I have almost forgotten this stuff now. If we do not filter out any samples during analysis (just discard at the time of measurement itself if we have to) it will make sure that we never get an empty list here. Though we can always check if we valid samples and bail out if we don't.

patrickdoc · 2018-01-15T07:47:58Z

Although I can't comment on the correctness of removing the overhead code, I can report that it seems to make the results more consistent. I've added a branch called no-overhead that essentially just replicates the guage patch. It seems merely commenting-out G.map fixTime is not enough.

I also removed G.tail. Removing it seemed to improve the R^2 value over 3 or 4 runs, but my results are mostly anecdotal.

On master (fixTime commented out)

benchmarking first
time                 18.02 ns   (17.66 ns .. 18.35 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 17.88 ns   (17.52 ns .. 18.30 ns)
std dev              813.1 ps   (645.3 ps .. 1.001 ns)
variance introduced by outliers: 67% (severely inflated)

$ ./run.sh
Up to date
benchmarking first
time                 17.63 ns   (17.37 ns .. 17.90 ns)
                     0.999 R²   (0.997 R² .. 0.999 R²)
mean                 17.66 ns   (17.42 ns .. 17.87 ns)
std dev              481.0 ps   (373.7 ps .. 649.8 ps)
variance introduced by outliers: 42% (moderately inflated)

On the no-overhead branch

$ ./run.sh
benchmarking first
time                 16.62 ns   (16.42 ns .. 16.83 ns)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 16.50 ns   (16.28 ns .. 16.72 ns)
std dev              469.3 ps   (366.5 ps .. 622.2 ps)
variance introduced by outliers: 44% (moderately inflated)

$ ./run.sh 
Up to date
benchmarking first
time                 16.61 ns   (16.43 ns .. 16.78 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 16.60 ns   (16.41 ns .. 16.77 ns)
std dev              394.7 ps   (336.5 ps .. 470.3 ps)
variance introduced by outliers: 36% (moderately inflated)

I don't really have an explanation, but this does seem like an improvement. The no-overhead branch is quite consistent on my computer.

harendra-kumar · 2018-01-15T07:57:32Z

Yes it is expected to make results more consistent because we are incorrectly deducting overhead from the timing measurements (using fixTime) which makes the measurements incorrect. Removing it actually makes them sane and should reduce variance. Also, the mean becomes saner with this fix, if you notice without this fix the mean provided by criterion does not even make sense all the time, sometimes it could even be outside the range. This overhead stuff is absolutely nonsensical.

patrickdoc · 2018-01-17T22:38:43Z

@RyanGlScott I have good news! This fix solves all three of #160, #161, and #162. It turns out, they are all special cases of this overhead code. In short, it comes down to how many samples are left after (incorrectly) filtering.

When 0 samples are left, you get this issue.

When 1 sample is left, you get #160. Because there is only 1 sample, the resample just copies it 1000 times causing there to be 0 standard deviation. Additionally, you can see in that issue that the mean is also just a single value. (You'll also see that the mean is nowhere near the time value listed. This is because of the overhead calcuation.)

When 2 samples are left, the bootstrap algorithm breaks due to this bug in Statistics and you get #162. It tries to index a vector with INT_MIN causing the Vector (!): index out of bounds error. This crashes the thread and causes the thread block.

When more than 2 samples remain, things operate as normal. This would also explain why you had trouble reproducing it. It is very tight on timings, which will be different from computer to computer.

Patching out the overhead code solves all of these issues. I'm getting around 15 samples where the old code produced 0/1/2. And the mean is much more reasonable.

RyanGlScott · 2018-01-18T00:26:22Z

Wonderful! Can you open a pull request for this?

patrickdoc · 2018-01-18T03:34:06Z

Done!

* Remove bad overhead code, see #161 * Update changelog

RyanGlScott · 2018-01-18T17:10:56Z

This should be fixed after we merged #178. @kindaro, can you verify if the issue still persists with criterion HEAD?

I'll optimistically close this issue, since @patrickdoc reports not being able to reproduce it anymore, but feel free to re-open if that is not the case.

- using `cabal sandbox add-source criterion` to - get the fix from master branch - see haskell/criterion#161 - should be removed after the new criterion release

kapralVV · 2018-01-31T17:06:56Z

The issue has not been fixed for me.
I've used cabal sandbox add-source ... to add HEAD criterion code from master branch , but it's still failing
...: Statistics.Quantile.weightedAvg: Sample is empty

RyanGlScott · 2018-01-31T17:08:50Z

Good to know, @kapralVV. It seems that there is more to the story, which @patrickdoc has been investigating in #179 and #180.

patrickdoc · 2018-02-01T06:19:04Z

@kapralVV I really don't think it should be possible for this to happen with HEAD. There are checks in place that should make everything run smoothly (though not 100% correctly).

I took a glance at your Travis logs (like this most recent failure) and I am worried that your script somehow isn't using the downloaded Criterion. Perhaps you are running into this?

You might be able to check by adding cabal install criterion like the author of that issue did. If it fails then you are still using the old version. I'm not super experienced with sandboxes though, so it could certainly be working correctly. If it is using the new version, then this is a new issue.

I tried running your benchmarks locally, but got:

hs3bench: QuickCheck.oneof used with empty list

kapralVV · 2018-02-01T08:24:28Z

@patrickdoc
I've added the code to show that criterion is from git HEAD .
See the last commit.

changing criterion version on-fly in the fetched git repo to 1.3.0.1
listing all packages installed in sandbox.

So you may see that criterion-1.3.0.1 is installed, it shows us that it IS the version from git HEAD

Facing the issue as well:
./Data/Vector/Generic.hs:245 ((!)): index out of bounds

kapralVV · 2018-02-01T15:01:10Z

@patrickdoc

Last 44 build showed the following issues:

hs3bench: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (-9223372036854775808,1000)
hs3bench: thread blocked indefinitely in an MVar operation

hs3bench: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-KNbAiGxqUXS3OrLnbRkS6L:Statistics.Quantile

patrickdoc · 2018-02-01T19:17:35Z

@kapralVV

Aha, I was able to reproduce it on my machine. It turns out the check I was referring to is actually slightly different than I had thought. Thank you for the report, it also turned up another bug!

@RyanGlScott

I can submit a band-aid patch for this, but I also have bigger changes that will cover this coming down the pike. If you would like one now though, let me know.

RyanGlScott · 2018-02-01T20:07:10Z

I can submit a band-aid patch for this, but I also have bigger changes that will cover this coming down the pike. If you would like one now though, let me know.

I'd be fine with whatever is more convenient. I'm not in a rush to put out a new release, given the multitude of other bugs you've found :)

- `criterion` bench will be always `green` even it fails - should be removed later - Issues : haskell/criterion#161

RyanGlScott added the Bug label Oct 6, 2017

kindaro mentioned this issue Oct 6, 2017

index out of bounds and thread blocked indefinitely runtime errors. #162

Closed

RyanGlScott added the Help wanted label Oct 6, 2017

dnadales mentioned this issue Dec 19, 2017

Move to LTS 10.0 TorXakis/TorXakis#552

Closed

gereeter mentioned this issue Jan 4, 2018

Rewrite Data.IntMap to be faster and use less memory haskell/containers#340

Open

RyanGlScott pushed a commit that referenced this issue Jan 18, 2018

Remove bad overhead code (#160, #161, #162) (#178)

3f130cc

* Remove bad overhead code, see #161 * Update changelog

RyanGlScott closed this as completed Jan 18, 2018

This was referenced Jan 30, 2018

Final measurement is dropped #179

Closed

Overhead redux (questioning #178) #180

Closed

kapralVV added a commit to kapralVV/hs3 that referenced this issue Feb 2, 2018

added mock to criterion bench until it's fixed

b405c94

- `criterion` bench will be always `green` even it fails - should be removed later - Issues : haskell/criterion#161

treeowl mentioned this issue Feb 2, 2018

benchmark error (with ghc-8.2.2 only?): set-operations-map: Statistics.Quantile.weightedAvg: Sample is empty haskell/containers#525

Closed

cricketer94 mentioned this issue Aug 16, 2019

Criterion giving negative measurements in some cases. #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient "Sample is empty" runtime error. #161

Transient "Sample is empty" runtime error. #161

kindaro commented Oct 5, 2017

RyanGlScott commented Oct 6, 2017 •

edited

Loading

patrickdoc commented Jan 13, 2018

RyanGlScott commented Jan 14, 2018

RyanGlScott commented Jan 14, 2018

harendra-kumar commented Jan 14, 2018

patrickdoc commented Jan 15, 2018 •

edited

Loading

harendra-kumar commented Jan 15, 2018

patrickdoc commented Jan 17, 2018 •

edited

Loading

RyanGlScott commented Jan 18, 2018

patrickdoc commented Jan 18, 2018

RyanGlScott commented Jan 18, 2018

kapralVV commented Jan 31, 2018

RyanGlScott commented Jan 31, 2018

patrickdoc commented Feb 1, 2018

kapralVV commented Feb 1, 2018 •

edited

Loading

kapralVV commented Feb 1, 2018

patrickdoc commented Feb 1, 2018

RyanGlScott commented Feb 1, 2018

Transient "Sample is empty" runtime error. #161

Transient "Sample is empty" runtime error. #161

Comments

kindaro commented Oct 5, 2017

RyanGlScott commented Oct 6, 2017 • edited Loading

patrickdoc commented Jan 13, 2018

RyanGlScott commented Jan 14, 2018

RyanGlScott commented Jan 14, 2018

harendra-kumar commented Jan 14, 2018

patrickdoc commented Jan 15, 2018 • edited Loading

harendra-kumar commented Jan 15, 2018

patrickdoc commented Jan 17, 2018 • edited Loading

RyanGlScott commented Jan 18, 2018

patrickdoc commented Jan 18, 2018

RyanGlScott commented Jan 18, 2018

kapralVV commented Jan 31, 2018

RyanGlScott commented Jan 31, 2018

patrickdoc commented Feb 1, 2018

kapralVV commented Feb 1, 2018 • edited Loading

kapralVV commented Feb 1, 2018

patrickdoc commented Feb 1, 2018

RyanGlScott commented Feb 1, 2018

RyanGlScott commented Oct 6, 2017 •

edited

Loading

patrickdoc commented Jan 15, 2018 •

edited

Loading

patrickdoc commented Jan 17, 2018 •

edited

Loading

kapralVV commented Feb 1, 2018 •

edited

Loading