Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary histogram over-counts entries #4

Open
ssimeonov opened this issue Apr 13, 2015 · 2 comments
Open

Binary histogram over-counts entries #4

ssimeonov opened this issue Apr 13, 2015 · 2 comments

Comments

@ssimeonov
Copy link

In the following output, note how the total from the binary histogram (35,869) is not the count (36,175). The linear histogram produces an accurate count.

irb(main):097:0> puts stats.to_s
value |------------------------------------------------------------------| count
    1 |                                                                  |    23
      ~
    8 |                                                                  |     3
   16 |                                                                  |    81
   32 |@@@@@                                                             |   703
   64 |@@@@@@@@@@@@@@                                                    |  1850
  128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                                   |  4086
  256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                               |  4630
  512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |  6332
 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|  8682
 2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |  5536
 4096 |@@@@@@@@@@@@@@@@@@@@@                                             |  2788
 8192 |@@@@@@@                                                           |   939
16384 |@                                                                 |   199
32768 |                                                                  |    14
65536 |                                                                  |     3
      ~
Total |------------------------------------------------------------------| 35869
=> nil
irb(main):098:0> stats.count
=> 36175
@halorgium
Copy link
Collaborator

The issue likely is due to how binary histograms handle 0 values.

> data = [0, 1]
=> [0, 1]
> a = Aggregate.new(0, 2, 1); data.each {|d| a << d}; puts a.to_s
value |------------------------------------------------------------------| count
    0 |@                                                                 |     1
    1 |@                                                                 |     1
Total |------------------------------------------------------------------|     2
> a = Aggregate.new; data.each {|d| a << d}; puts a.to_s
value |------------------------------------------------------------------| count
    1 |@                                                                 |     1
      ~
Total |------------------------------------------------------------------|     1

@josephruscio
Copy link
Owner

@halorgium is right I think. For better or for worse the binary histogram uses a low value of 1. So 0's are counted as outliers. Outliers are included the count instance variable but the to_s method maintains a local internal count it tallies up as it runs through the buckets. As you can see above to_s doesn't take outliers into account, either in printing them out or including them in locally tracked count. Ideally this would be more consistent, I don't think the data sets I was analyzing when I wrote this had many (any?) zero values.

Sorry I missed this for so long. If we wanted to "fix" it I think it would need a major version bump as this would break any code depending on the now 10-year-old (yikes) behavior, however convoluted it is in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@halorgium @josephruscio @ssimeonov and others