Skip to content
/ hetchy Public

A high performance, thread-safe reservoir sampler with snapshot and percentile support.

License

Notifications You must be signed in to change notification settings

nextmat/hetchy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hetchy

Gem Version Build Status Code Climate

A high performance, thread-safe reservoir sampler for ruby with snapshot and percentile support.

Benefits/Goals

  • Generate statistics for large datasets with a very small in-memory footprint
  • Generate percentiles/quantiles for a known set or numbers or for real-time streaming set
  • Ability to capture sample state at a moment in time for further analysis
  • Speed
  • Thorough test suite
  • No dependencies

Installation

Add this line to your application's Gemfile:

gem 'hetchy'

And then $ bundle

Or install it yourself with:

$ gem install hetchy

Usage

Create a reservoir and designate how big you want it to be:

reservoir = Hetchy::Reservoir.new(size: 1000)

Add samples as they arrive or are generated:

reservoir << 123

You can add sets of samples as well:

reservoir << [45,89,124,96]

You can calculate a percentile at any time:

reservoir.percentile(95)

Hetchy supports high resolution percentiles as well:

reservoir.percentile(99.99)

NOTE: you may need to increase reservoir size for very high resolution percentiles. Experiment to see what works for your data set.

For threaded applications where the reservoir is accepting samples rapidly you can increase performance by snapshotting before running a series of calculations on the reservoir:

dataset  = reservoir.snapshot

perc_95  = dataset.percentile(95)
perc_99  = dataset.percentile(99)
perc_999 = dataset.percentile(99.9)

Clear the reservoir to reset it:

reservoir.clear

Datasets

If you have an existing series you can use Dataset to generate stats for it:

my_series = Array(1..1000)
dataset = Hetchy::Dataset.new(my_series)

perc_95 = dataset.percentile(95)   #=> 950.95
median  = dataset.median           #=> 500.5

Stats Details

For those interested:

  • Reservoir sampling is based on Vitter's algorithm R, ensures a uniform sampling probability for every entry in the series
  • Percentile calculations use weighted averages, not nearest neighbor

Contributing

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it.
  • Fork the project and submit a pull request from a feature or bugfix branch.
  • Please include tests. This is important so we don't break your changes unintentionally in a future version.
  • Please don't modify the gemspec, Rakefile, version, or changelog. If you do change these files, please isolate a separate commit so we can cherry-pick around it.

Credits

Parts of Hetchy are inspired by Eric Lindvall's metriks gem.

Copyright

Copyright (c) 2015 Matt Sanders. See LICENSE for details.

About

A high performance, thread-safe reservoir sampler with snapshot and percentile support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages