Detect keyboard mashing and other junk in your data.
For example, if you allow user-entered tags, but want to hide bad ones. Or if you want to detect user frustration filling out a particular field, and do something about it!
Uses a variety of heuristics, the most sophisticated being a comparison of bigrams in the input to the frequencies in a "known-good" corpus vs. their proximity on a keyboard. Achieves pretty good precision on Academia.edu's data, but might need adjustment for yours.
Add this line to your application's Gemfile:
gem 'dejunk'
And then execute:
$ bundle
Or install it yourself as:
$ gem install dejunk
The main interface is Dejunk.is_junk?
. Pass a string, and get a truthy value
if it looks junky, and false otherwise.
$ Dejunk.is_junk?('Hello World')
=> false
$ Dejunk.is_junk?('qwefqwef')
=> :mashing_bigrams
$ Dejunk.is_junk?('asdf')
=> :asdf_row
$ Dejunk.is_junk?('fads')
=> false
$ Dejunk.is_junk?('Hi')
=> :too_short
$ Dejunk.is_junk?('Hi', whitelist_regexes: [/\Ahi\z/i])
=> false
Returns a reason when junk is detected for aid in debugging. Optional parameters
are min_alnum_chars
(defaults to 3), and whitelist_strings
and
whitelist_regexes
(both default to none, but you'll likely want some domain-specific
strings here, which you might discover by checking against a sample from your existing
corpus).
After checking out the repo, run bin/setup
to install dependencies. Then, run
rake spec
to run the tests. You can also run bin/console
for an interactive
prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
.
To release a new version, update the version number in version.rb
, and then run
bundle exec rake release
, which will create a git tag for the version,
push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/academia-edu/dejunk
Apache 2.0