Remove enable_threshold config from Bayes classifier #123

ibnesayeed · 2017-01-14T03:17:39Z

Currently, the threshold option of the Bayes classifier only makes sense if the enable_threshold option is also set to true (at initialization or later dynamically). This tight interdependence makes me think that the threshold enabler flag is an overhead. However, by making the default value of the threshold property as Float::INFINITY we can cleverly get rid of the enable_threshold option. Let's have a look at the current implementation of the classify method:

def classify(text)
  result, score = classify_with_score(text)
  result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
  result
end

If we change it to something like this:

def classify(text)
  result, score = classify_with_score(text)
  score < @threshold ? nil : result
end

The modified method will behave exactly the same as the current implementation with threshold_enabled set to true if the default value of threshold is Float::INFINITY.

However, there is a case where it will not behave the same as the current implementation. Suppose that the threshold_enabled is set to false and classify_with_score returned a class with score Infinity. In this case the current implementation would return that class while the modified method will return nil instead. This is obviously a breaking change. However, I think the current implementation, without the threshold in action, is broken anyway, so we might as well get rid of that behavior completely. Here is what I mean by broken:

[1] pry(main)> require 'classifier-reborn'
[2] pry(main)> b = ClassifierReborn::Bayes.new 'Foo', 'Bar'
[3] pry(main)> b.classifications("Baz")
=> {"Foo"=>Infinity, "Bar"=>Infinity}
[4] pry(main)> b.classify_with_score("Baz")
=> ["Foo", Infinity]

As illustrated above, for the text "Baz", the category Foo is as "irrelevant" as Bar, so why should return one, but not the other. Hence the nil is a better choice here in my opinion. One could argue that the score collision could happen even if the scores are not Infinity, but the keyword here is "irrelevant". In case of a real score value the distance of the classification text vector is same from the competing category vectors, so they are equally relevant and picking one or the other is equally good. However, if the score is Infinity then saying one Infinity is equal to the other Infinity is mathematically incorrect.

With this proposal in place, we can get rid of many unnecessary attributes and convenience methods (including threshold_enabled, threshold, threshold=, threshold_enabled?, threshold_disabled?, enable_threshold, and disable_threshold). I am not sure how useful these dynamic enablers and disablers are if the user has access to raw scores for making custom decisions. This will make the classifier API simpler for consumers, easier to document, and easier to maintain.

The text was updated successfully, but these errors were encountered:

ibnesayeed · 2017-01-14T03:22:49Z

I will add #47 here for reference.

Ch4s3 · 2017-01-14T16:29:59Z

Yeah, I remember #47 pretty well. I agree that it's a bit broken, but I'd like to fix it rather than removing it. Perhaps we can fix it for 3.0, and make the false flag work correctly.

Looking at this line:
result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?

I feel like we do need a fix. I'm not opposed to cleaning up the API though.

ibnesayeed · 2017-01-14T16:48:46Z

I think you can add a label like 3.x so that we can resurface it when the times comes.

I agree that it's a bit broken, but I'd like to fix it rather than removing it.

The issue I described, if it is fixed, then I think enable_threshold will become completely irrelevant that can be achieved by setting the default threshold to Infinity. However, for the sake of backward compatibility, we can perhaps keep it for 2.x.

Ch4s3 · 2017-01-14T16:51:34Z

Agreed. Let's discuss this further once we make plans for 3.0

Ch4s3 added bug discussion labels Jan 14, 2017

Ch4s3 added the 3.0 label Jan 14, 2017

ibnesayeed mentioned this issue Jan 17, 2017

don't train or untrain empty word hashes #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove enable_threshold config from Bayes classifier #123

Remove enable_threshold config from Bayes classifier #123

ibnesayeed commented Jan 14, 2017

ibnesayeed commented Jan 14, 2017

Ch4s3 commented Jan 14, 2017

ibnesayeed commented Jan 14, 2017 •

edited

Loading

Ch4s3 commented Jan 14, 2017

Remove enable_threshold config from Bayes classifier #123

Remove enable_threshold config from Bayes classifier #123

Comments

ibnesayeed commented Jan 14, 2017

ibnesayeed commented Jan 14, 2017

Ch4s3 commented Jan 14, 2017

ibnesayeed commented Jan 14, 2017 • edited Loading

Ch4s3 commented Jan 14, 2017

ibnesayeed commented Jan 14, 2017 •

edited

Loading