TactfulTokenizer¶ ↑

TactfulTokenizer is a Ruby library for high quality sentence tokenization. It uses a Naive Bayesian statistical model, and is based on Splitta, but has support for ‘?’ and ‘!’ as well as primitive handling of XHTML markup. Better support for XHTML parsing is coming shortly.

Additionally supports unicode text tokenization.

Usage¶ ↑

require "tactful_tokenizer"
m = TactfulTokenizer::Model.new
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]

The input text is expected to consist of paragraphs delimited by line breaks.

Installation¶ ↑

gem install tactful_tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rdoc

README.rdoc

TactfulTokenizer¶ ↑

Usage¶ ↑

Installation¶ ↑

Author¶ ↑

Files

README.rdoc

Latest commit

History

README.rdoc

File metadata and controls

TactfulTokenizer¶ ↑

Usage¶ ↑

Installation¶ ↑

Author¶ ↑