Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove enable_threshold config from Bayes classifier #123

Open
ibnesayeed opened this issue Jan 14, 2017 · 4 comments
Open

Remove enable_threshold config from Bayes classifier #123

ibnesayeed opened this issue Jan 14, 2017 · 4 comments

Comments

@ibnesayeed
Copy link
Contributor

Currently, the threshold option of the Bayes classifier only makes sense if the enable_threshold option is also set to true (at initialization or later dynamically). This tight interdependence makes me think that the threshold enabler flag is an overhead. However, by making the default value of the threshold property as Float::INFINITY we can cleverly get rid of the enable_threshold option. Let's have a look at the current implementation of the classify method:

def classify(text)
  result, score = classify_with_score(text)
  result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?
  result
end

If we change it to something like this:

def classify(text)
  result, score = classify_with_score(text)
  score < @threshold ? nil : result
end

The modified method will behave exactly the same as the current implementation with threshold_enabled set to true if the default value of threshold is Float::INFINITY.

However, there is a case where it will not behave the same as the current implementation. Suppose that the threshold_enabled is set to false and classify_with_score returned a class with score Infinity. In this case the current implementation would return that class while the modified method will return nil instead. This is obviously a breaking change. However, I think the current implementation, without the threshold in action, is broken anyway, so we might as well get rid of that behavior completely. Here is what I mean by broken:

[1] pry(main)> require 'classifier-reborn'
[2] pry(main)> b = ClassifierReborn::Bayes.new 'Foo', 'Bar'
[3] pry(main)> b.classifications("Baz")
=> {"Foo"=>Infinity, "Bar"=>Infinity}
[4] pry(main)> b.classify_with_score("Baz")
=> ["Foo", Infinity]

As illustrated above, for the text "Baz", the category Foo is as "irrelevant" as Bar, so why should return one, but not the other. Hence the nil is a better choice here in my opinion. One could argue that the score collision could happen even if the scores are not Infinity, but the keyword here is "irrelevant". In case of a real score value the distance of the classification text vector is same from the competing category vectors, so they are equally relevant and picking one or the other is equally good. However, if the score is Infinity then saying one Infinity is equal to the other Infinity is mathematically incorrect.

With this proposal in place, we can get rid of many unnecessary attributes and convenience methods (including threshold_enabled, threshold, threshold=, threshold_enabled?, threshold_disabled?, enable_threshold, and disable_threshold). I am not sure how useful these dynamic enablers and disablers are if the user has access to raw scores for making custom decisions. This will make the classifier API simpler for consumers, easier to document, and easier to maintain.

@ibnesayeed
Copy link
Contributor Author

I will add #47 here for reference.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 14, 2017

Yeah, I remember #47 pretty well. I agree that it's a bit broken, but I'd like to fix it rather than removing it. Perhaps we can fix it for 3.0, and make the false flag work correctly.

Looking at this line:
result = nil if score < @threshold || score == Float::INFINITY if threshold_enabled?

I feel like we do need a fix. I'm not opposed to cleaning up the API though.

@ibnesayeed
Copy link
Contributor Author

ibnesayeed commented Jan 14, 2017

I think you can add a label like 3.x so that we can resurface it when the times comes.

I agree that it's a bit broken, but I'd like to fix it rather than removing it.

The issue I described, if it is fixed, then I think enable_threshold will become completely irrelevant that can be achieved by setting the default threshold to Infinity. However, for the sake of backward compatibility, we can perhaps keep it for 2.x.

@Ch4s3 Ch4s3 added the 3.0 label Jan 14, 2017
@Ch4s3
Copy link
Member

Ch4s3 commented Jan 14, 2017

Agreed. Let's discuss this further once we make plans for 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants