Stopwords and relevance

The last topic to cover before moving on from stopwords is that of relevance. Leaving stopwords in your index can potentially make the relevance calculation less accurate, especially if your documents are very long.

As we have already discussed in [bm25-saturation], the reason for this is that Term frequency/Inverse document frequency doesn’t impose an upper limit on the impact of term frequency. Very common words may have a low weight because of inverse document frequency but, in long documents, the sheer number of occurrences of stopwords in a single document may lead to their weight being artificially boosted.

You may want to consider using the Okapi BM25 similarity on long fields which include stopwords instead of the default Lucene similarity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

70_Relevance.asciidoc

70_Relevance.asciidoc

Stopwords and relevance

Files

70_Relevance.asciidoc

Latest commit

History

70_Relevance.asciidoc

File metadata and controls

Stopwords and relevance