Other classification/nlp tools #88

Ch4s3 · 2017-01-02T06:30:38Z

We already do a bag of words, and word counts. Would it be useful to anyone to expose this functionality for other classification uses?

Some other things to consider:

N-grams
Levenshtein distance
Sentiment analysis
term frequency–inverse document frequency

Ch4s3 · 2017-01-06T04:04:53Z

@tra38 could you elaborate on which part(s) you're interested in?

tra38 · 2017-01-06T04:57:03Z

Sure. Since classifier-reborn already collects a bunch of data already, it makes sense to publicly expose the data that "classifier-reborn" gathered, so that a programmer can then feed that data into other gems that handle different classification/nlp tasks.

For example, I've been clustering articles using classifier-reborn and kmeans-clusterer with the following code snippet:

require 'classifier-reborn'
require 'kmeans-clusterer'

lsi = ClassifierReborn::LSI.new

strings = ["example string a", "example string b", "example string c"]

strings.each do |x|
  lsi.add_item(x)
end

# Save transformed ClassiferReborn Content Nodes into new array
string_data = lsi.instance_variable_get(:"@items")

# Process the information for use in kmeans-clusterer
data = strings.map do |string|
  string_data[string].lsi_norm.to_a
end

clusters = 13
kmeans = KMeansClusterer.run clusters, data, labels: strings, runs: 10

And obviously, it's kinda hacky to try to get the lsi_norm for each individual content node just so that you can then do some k-means clustering, which is why I gave you a "thumbs up" for considering exposing this data more directly. (And if I'm using some aspect of classifer-reborn strangely here, then some other programmer will use bags of words and word counts strangely as well. Expose all the data, trust the programmer.)

Ch4s3 · 2017-01-06T06:37:39Z

@tra38 I think we could expose the lsi data. It'll probably take some careful refactoring, but should be doable.

Looooong · 2017-01-11T10:54:01Z

Will there be multiple classification? For example: given an input, classify it into more than one category.

Ch4s3 · 2017-01-11T13:38:48Z

@Looooong not with bayes, that's not really how it works.

ibnesayeed · 2017-01-15T03:52:59Z

@Looooong: Will there be multiple classification? For example: given an input, classify it into more than one category.

You can get the raw score of each category against a given text in Bayes. This way you can decide to get top-K relevant categories, if that is what you are after.

ibnesayeed · 2017-01-15T03:55:45Z

Should we also consider adding ruby-fann (Fast Artificial Neural Network). It wont be good for text data I guess, but for numeric stuff it would be great.

Looooong · 2017-01-15T08:34:42Z

@ibnesayeed Yes, I am planning to make multiple score with Bayes, but I guess it will take up a big amount of storage space.

ibnesayeed · 2017-01-15T14:25:54Z

@Looooong it really depends on the amount of training data. Between Bayes and LSI the first one would take relatively less space. If you have huge amount of data then here are a few things you can do:

Use the newly introduced Redis backend for storage, which would still take the required amount of memory, but it can be off-loaded to a remote machine that has high memory. Additionally, it will persist the data on the disk in case of any sudden crashes.
Use a sample of training data, not the whole of it. Then throw a bunch of test data to see how well it is performing and how many false positives and false negatives you are getting. If the classifier is giving satisfactory results then no need to train further, other wise train with more data and measure the results again. This way you can find the right balance between how much memory you can afford and the minimum accuracy you can accept as the trade off.
If you really want to use all the training data, but can't afford enough memory then you can implement an ORM backend and save the model in your favorite database. This would be terribly slow as compared with Memory backend, but you can train with petabytes of data. Implementing that wont be difficult as the storage stuff was abstracted recently.

Ch4s3 · 2017-01-17T22:35:27Z

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

ibnesayeed · 2017-01-18T01:46:45Z

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

I agree. However, I would note one thing here that I encountered today while writing tests for stopwords. This needs to be instantiated and used as dependency injection during classifier initialization so that one classifier does not step over on the other's state due to the shared data. Currently, if in a single program, two classifiers are instantiated with different configuration, and one of them is making changes in the set of stopwords then the other classifier will also get affected. To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

Ch4s3 · 2017-01-18T03:24:10Z

To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

Yeah I noticed that. I think DI is the way to go then.

ibnesayeed · 2017-01-18T03:34:27Z

Yeah I noticed that. I think DI is the way to go then.

In fact I had many other test cases in mind around stopwords that I could not put in place because they were seemingly very difficult (if not impossible). Similarly, some test cases could have been put together as part of assertions, but I had to separate them and duplicate most of the logic because of this stepping over behavior.

ibnesayeed mentioned this issue Jan 6, 2017

Redis backend for LSI #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other classification/nlp tools #88

Other classification/nlp tools #88

Ch4s3 commented Jan 2, 2017 •

edited

Loading

Ch4s3 commented Jan 6, 2017

tra38 commented Jan 6, 2017 •

edited

Loading

Ch4s3 commented Jan 6, 2017

Looooong commented Jan 11, 2017

Ch4s3 commented Jan 11, 2017

ibnesayeed commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Looooong commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 18, 2017

Ch4s3 commented Jan 18, 2017

ibnesayeed commented Jan 18, 2017

Other classification/nlp tools #88

Other classification/nlp tools #88

Comments

Ch4s3 commented Jan 2, 2017 • edited Loading

Ch4s3 commented Jan 6, 2017

tra38 commented Jan 6, 2017 • edited Loading

Ch4s3 commented Jan 6, 2017

Looooong commented Jan 11, 2017

Ch4s3 commented Jan 11, 2017

ibnesayeed commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Looooong commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 18, 2017

Ch4s3 commented Jan 18, 2017

ibnesayeed commented Jan 18, 2017

Ch4s3 commented Jan 2, 2017 •

edited

Loading

tra38 commented Jan 6, 2017 •

edited

Loading