Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other classification/nlp tools #88

Open
4 tasks
Ch4s3 opened this issue Jan 2, 2017 · 13 comments
Open
4 tasks

Other classification/nlp tools #88

Ch4s3 opened this issue Jan 2, 2017 · 13 comments

Comments

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 2, 2017

We already do a bag of words, and word counts. Would it be useful to anyone to expose this functionality for other classification uses?

Some other things to consider:

  • N-grams
  • Levenshtein distance
  • Sentiment analysis
  • term frequency–inverse document frequency
@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 6, 2017

@tra38 could you elaborate on which part(s) you're interested in?

@tra38
Copy link
Contributor

tra38 commented Jan 6, 2017

Sure. Since classifier-reborn already collects a bunch of data already, it makes sense to publicly expose the data that "classifier-reborn" gathered, so that a programmer can then feed that data into other gems that handle different classification/nlp tasks.

For example, I've been clustering articles using classifier-reborn and kmeans-clusterer with the following code snippet:

require 'classifier-reborn'
require 'kmeans-clusterer'

lsi = ClassifierReborn::LSI.new

strings = ["example string a", "example string b", "example string c"]

strings.each do |x|
  lsi.add_item(x)
end

# Save transformed ClassiferReborn Content Nodes into new array
string_data = lsi.instance_variable_get(:"@items")

# Process the information for use in kmeans-clusterer
data = strings.map do |string|
  string_data[string].lsi_norm.to_a
end

clusters = 13
kmeans = KMeansClusterer.run clusters, data, labels: strings, runs: 10

And obviously, it's kinda hacky to try to get the lsi_norm for each individual content node just so that you can then do some k-means clustering, which is why I gave you a "thumbs up" for considering exposing this data more directly. (And if I'm using some aspect of classifer-reborn strangely here, then some other programmer will use bags of words and word counts strangely as well. Expose all the data, trust the programmer.)

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 6, 2017

@tra38 I think we could expose the lsi data. It'll probably take some careful refactoring, but should be doable.

@Looooong
Copy link
Contributor

Will there be multiple classification? For example: given an input, classify it into more than one category.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 11, 2017

@Looooong not with bayes, that's not really how it works.

@ibnesayeed
Copy link
Contributor

@Looooong: Will there be multiple classification? For example: given an input, classify it into more than one category.

You can get the raw score of each category against a given text in Bayes. This way you can decide to get top-K relevant categories, if that is what you are after.

@ibnesayeed
Copy link
Contributor

Should we also consider adding ruby-fann (Fast Artificial Neural Network). It wont be good for text data I guess, but for numeric stuff it would be great.

@Looooong
Copy link
Contributor

@ibnesayeed Yes, I am planning to make multiple score with Bayes, but I guess it will take up a big amount of storage space.

@ibnesayeed
Copy link
Contributor

@Looooong it really depends on the amount of training data. Between Bayes and LSI the first one would take relatively less space. If you have huge amount of data then here are a few things you can do:

  • Use the newly introduced Redis backend for storage, which would still take the required amount of memory, but it can be off-loaded to a remote machine that has high memory. Additionally, it will persist the data on the disk in case of any sudden crashes.
  • Use a sample of training data, not the whole of it. Then throw a bunch of test data to see how well it is performing and how many false positives and false negatives you are getting. If the classifier is giving satisfactory results then no need to train further, other wise train with more data and measure the results again. This way you can find the right balance between how much memory you can afford and the minimum accuracy you can accept as the trade off.
  • If you really want to use all the training data, but can't afford enough memory then you can implement an ORM backend and save the model in your favorite database. This would be terribly slow as compared with Memory backend, but you can train with petabytes of data. Implementing that wont be difficult as the storage stuff was abstracted recently.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

@ibnesayeed
Copy link
Contributor

Since Hasher is basically a bag of words implementation, it might make sense to rename it as such and make it public and document it as such. Thoughts? @ibnesayeed?

I agree. However, I would note one thing here that I encountered today while writing tests for stopwords. This needs to be instantiated and used as dependency injection during classifier initialization so that one classifier does not step over on the other's state due to the shared data. Currently, if in a single program, two classifiers are instantiated with different configuration, and one of them is making changes in the set of stopwords then the other classifier will also get affected. To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 18, 2017

To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing.

Yeah I noticed that. I think DI is the way to go then.

@ibnesayeed
Copy link
Contributor

Yeah I noticed that. I think DI is the way to go then.

In fact I had many other test cases in mind around stopwords that I could not put in place because they were seemingly very difficult (if not impossible). Similarly, some test cases could have been put together as part of assertions, but I had to separate them and duplicate most of the logic because of this stepping over behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants