Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient "Sample is empty" runtime error. #161

Closed
kindaro opened this issue Oct 5, 2017 · 18 comments
Closed

Transient "Sample is empty" runtime error. #161

kindaro opened this issue Oct 5, 2017 · 18 comments

Comments

@kindaro
Copy link
Contributor

kindaro commented Oct 5, 2017

Let me start with the terminal output:

% ./CriterionRight.hs
benchmarking const
time                 57.80 ns   (57.02 ns .. 58.37 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 NaN s      (NaN s .. NaN s)
std dev              0.0 s      (0.0 s .. 0.0 s)
variance introduced by outliers: -9223372036854775808% (severely inflated)

CriterionRight.hs: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-FhOftJvY6wWEPvUcnpp3zH:Statistics.Quantile

This issue is transient: I launched my script about 10 times tonight and 3 out of the attempts ended like this.

The code I use can be reviewed here: github.com/kindaro/criterion-wut.

@RyanGlScott RyanGlScott added the Bug label Oct 6, 2017
@RyanGlScott
Copy link
Member

RyanGlScott commented Oct 6, 2017

I suspect this might have a symptom in common in #160, but again, it's hard for me to tell since I can't reproduce this on my machine (answers to the questions in #160 (comment) would be helpful).

One hunch is that there's NaNs in the result, and it only happens sporadically. There's a known issue where some computations can trigger a bug in statistics' linear regression sampling and return NaN. However, these are typically long-running computations that only return a few data points (all of which are identical) per run, so the fact that it would show up here (where the program finishes in a matter of nanoseconds) would be rather unusual.

@patrickdoc
Copy link
Contributor

I can replicate this bug consistently using the code here: https://github.com/patrickdoc/criterion-bug

I did some tracing, and found the issue to be with this block of code
https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Analysis.hs#L151-L154

When I run the benchmark, the overhead adjustment makes all of the measurements negative. They are then filtered out causing the classifyOutliers input to be empty. The criterion I included in my repo is just current master with a few comments in the above file and a changed version number because I am not yet savvy with new-build and the cabal.project file. It's not necessary, but I found it nice to be able to quickly tweak the code.

@RyanGlScott
Copy link
Member

Thanks for digging into this, @patrickdoc. I wasn't able to reproduce the bug using your repo with an unmodified criterion, but one thing that did cause the bug to trigger was by changing threshold:

https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Measurement.hs#L220-L231

From 0.03 to 0.0:

$ ./dist/build/criterion-bug/criterion-bug
benchmarking first
time                 7.525 ns   (1.998 ns .. 12.20 ns)
                     0.224 R²   (0.017 R² .. 0.570 R²)
mean                 NaN s      (NaN s .. NaN s)
std dev              0.0 s      (0.0 s .. 0.0 s)
variance introduced by outliers: -9223372036854775808% (severely inflated)

criterion-bug: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-7Y0lYtBgBYcB9s5M0K3L16:Statistics.Quantile

@RyanGlScott
Copy link
Member

I might be a bit out of my element in here in trying to debug this, so I'll ask for help.

@harendra-kumar, I noticed that you've changed this sort of code in vincenthz/hs-gauge@92436a4#diff-5ac48df8dda31e28f3b275e7dd76177c. Would it make sense to adopt the same change in criterion (or does this only make sense in the context of gauge)?

@harendra-kumar
Copy link

@patrickdoc is correct, this error is due to all samples getting filtered out. There are two issues here. The first one is the incorrect use of overhead, it does not make sense, maybe it is leftover of some old code, we do not need this that is what I did in that gauge commit.

The second one is the use of G.tail here which filters one more sample, I think this may have been introduced to filter out the deviation introduced by the first iteration. The first iteration usually has a large deviation due to evaluations that might occur once and then reused in subsequent iterations. But if you notice there is code in runBenchmark as well https://github.com/bos/criterion/blob/9a2c3e1697434846778fdb313693425a1be40d0d/Criterion/Measurement.hs#L271 to get rid of the first iteration. So we are doing this twice, we can do it at just one place instead.

To answer your question, I think it makes sense to get rid of the overhead stuff, I guess you will have to make changes in runBenchmark as well. I have almost forgotten this stuff now. If we do not filter out any samples during analysis (just discard at the time of measurement itself if we have to) it will make sure that we never get an empty list here. Though we can always check if we valid samples and bail out if we don't.

@patrickdoc
Copy link
Contributor

patrickdoc commented Jan 15, 2018

Although I can't comment on the correctness of removing the overhead code, I can report that it seems to make the results more consistent. I've added a branch called no-overhead that essentially just replicates the guage patch. It seems merely commenting-out G.map fixTime is not enough.

I also removed G.tail. Removing it seemed to improve the R^2 value over 3 or 4 runs, but my results are mostly anecdotal.

On master (fixTime commented out)

benchmarking first
time                 18.02 ns   (17.66 ns .. 18.35 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 17.88 ns   (17.52 ns .. 18.30 ns)
std dev              813.1 ps   (645.3 ps .. 1.001 ns)
variance introduced by outliers: 67% (severely inflated)

$ ./run.sh
Up to date
benchmarking first
time                 17.63 ns   (17.37 ns .. 17.90 ns)
                     0.999 R²   (0.997 R² .. 0.999 R²)
mean                 17.66 ns   (17.42 ns .. 17.87 ns)
std dev              481.0 ps   (373.7 ps .. 649.8 ps)
variance introduced by outliers: 42% (moderately inflated)

On the no-overhead branch

$ ./run.sh
benchmarking first
time                 16.62 ns   (16.42 ns .. 16.83 ns)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 16.50 ns   (16.28 ns .. 16.72 ns)
std dev              469.3 ps   (366.5 ps .. 622.2 ps)
variance introduced by outliers: 44% (moderately inflated)

$ ./run.sh 
Up to date
benchmarking first
time                 16.61 ns   (16.43 ns .. 16.78 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 16.60 ns   (16.41 ns .. 16.77 ns)
std dev              394.7 ps   (336.5 ps .. 470.3 ps)
variance introduced by outliers: 36% (moderately inflated)

I don't really have an explanation, but this does seem like an improvement. The no-overhead branch is quite consistent on my computer.

@harendra-kumar
Copy link

Yes it is expected to make results more consistent because we are incorrectly deducting overhead from the timing measurements (using fixTime) which makes the measurements incorrect. Removing it actually makes them sane and should reduce variance. Also, the mean becomes saner with this fix, if you notice without this fix the mean provided by criterion does not even make sense all the time, sometimes it could even be outside the range. This overhead stuff is absolutely nonsensical.

@patrickdoc
Copy link
Contributor

patrickdoc commented Jan 17, 2018

@RyanGlScott I have good news! This fix solves all three of #160, #161, and #162. It turns out, they are all special cases of this overhead code. In short, it comes down to how many samples are left after (incorrectly) filtering.

When 0 samples are left, you get this issue.

When 1 sample is left, you get #160. Because there is only 1 sample, the resample just copies it 1000 times causing there to be 0 standard deviation. Additionally, you can see in that issue that the mean is also just a single value. (You'll also see that the mean is nowhere near the time value listed. This is because of the overhead calcuation.)

When 2 samples are left, the bootstrap algorithm breaks due to this bug in Statistics and you get #162. It tries to index a vector with INT_MIN causing the Vector (!): index out of bounds error. This crashes the thread and causes the thread block.

When more than 2 samples remain, things operate as normal. This would also explain why you had trouble reproducing it. It is very tight on timings, which will be different from computer to computer.

Patching out the overhead code solves all of these issues. I'm getting around 15 samples where the old code produced 0/1/2. And the mean is much more reasonable.

@RyanGlScott
Copy link
Member

Wonderful! Can you open a pull request for this?

@patrickdoc
Copy link
Contributor

Done!

RyanGlScott pushed a commit that referenced this issue Jan 18, 2018
* Remove bad overhead code, see #161

* Update changelog
@RyanGlScott
Copy link
Member

This should be fixed after we merged #178. @kindaro, can you verify if the issue still persists with criterion HEAD?

I'll optimistically close this issue, since @patrickdoc reports not being able to reproduce it anymore, but feel free to re-open if that is not the case.

kapralVV added a commit to kapralVV/hs3 that referenced this issue Jan 31, 2018
- using `cabal sandbox add-source criterion` to
- get the fix from master branch
- see haskell/criterion#161
- should be removed after the new criterion release
@kapralVV
Copy link

The issue has not been fixed for me.
I've used cabal sandbox add-source ... to add HEAD criterion code from master branch , but it's still failing
...: Statistics.Quantile.weightedAvg: Sample is empty

@RyanGlScott
Copy link
Member

Good to know, @kapralVV. It seems that there is more to the story, which @patrickdoc has been investigating in #179 and #180.

@patrickdoc
Copy link
Contributor

@kapralVV I really don't think it should be possible for this to happen with HEAD. There are checks in place that should make everything run smoothly (though not 100% correctly).

I took a glance at your Travis logs (like this most recent failure) and I am worried that your script somehow isn't using the downloaded Criterion. Perhaps you are running into this?

You might be able to check by adding cabal install criterion like the author of that issue did. If it fails then you are still using the old version. I'm not super experienced with sandboxes though, so it could certainly be working correctly. If it is using the new version, then this is a new issue.

I tried running your benchmarks locally, but got:

hs3bench: QuickCheck.oneof used with empty list

@kapralVV
Copy link

kapralVV commented Feb 1, 2018

@patrickdoc
I've added the code to show that criterion is from git HEAD .
See the last commit.

  • changing criterion version on-fly in the fetched git repo to 1.3.0.1
  • listing all packages installed in sandbox.

So you may see that criterion-1.3.0.1 is installed, it shows us that it IS the version from git HEAD

Facing the issue as well:
./Data/Vector/Generic.hs:245 ((!)): index out of bounds

@kapralVV
Copy link

kapralVV commented Feb 1, 2018

@patrickdoc

Last 44 build showed the following issues:

hs3bench: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (-9223372036854775808,1000)
hs3bench: thread blocked indefinitely in an MVar operation
hs3bench: Statistics.Quantile.weightedAvg: Sample is empty
CallStack (from HasCallStack):
  error, called at ./Statistics/Quantile.hs:190:16 in statistics-0.14.0.2-KNbAiGxqUXS3OrLnbRkS6L:Statistics.Quantile

@patrickdoc
Copy link
Contributor

@kapralVV

Aha, I was able to reproduce it on my machine. It turns out the check I was referring to is actually slightly different than I had thought. Thank you for the report, it also turned up another bug!

@RyanGlScott

I can submit a band-aid patch for this, but I also have bigger changes that will cover this coming down the pike. If you would like one now though, let me know.

@RyanGlScott
Copy link
Member

I can submit a band-aid patch for this, but I also have bigger changes that will cover this coming down the pike. If you would like one now though, let me know.

I'd be fine with whatever is more convenient. I'm not in a rush to put out a new release, given the multitude of other bugs you've found :)

kapralVV added a commit to kapralVV/hs3 that referenced this issue Feb 2, 2018
- `criterion` bench will be always `green` even it fails
- should be removed later
- Issues : haskell/criterion#161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants