Skip to content

Latest commit

 

History

History
235 lines (188 loc) · 8.3 KB

15_Practical_scoring.asciidoc

File metadata and controls

235 lines (188 loc) · 8.3 KB

Lucene’s Practical Scoring Function

For multi-term queries, Lucene takes the Boolean Model, TF/IDF, and the Vector Space Model and combines them in a single efficient package which collects matching documents and scores them as it goes.

A multi-term query like:

GET /my_index/doc/_search
{
  "query": {
    "match": {
      "text": "quick fox"
    }
  }
}

is rewritten internally to look like this:

GET /my_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": { "text": "quick" }},
        {"term": { "text": "fox"   }}
      ]
    }
  }
}

The bool query implements the Boolean Model and, in this example, will only include documents which contain either the term quick or the term fox or both.

As soon as a document matches a query, Lucene calculates its score for that query, combining the scores of each matching term. The formula used for scoring is called the Practical Scoring Function. It looks intimidating, but don’t be put off — most of the components you already know. It introduces a few new elements which we discuss below.

score(q,d)  =  (1)
            queryNorm(q)  (2)
          · coord(q,d)    (3)
          · ∑ (           (4)
                tf(t in d)   (5)
              · idf(t)²      (6)
              · t.getBoost() (7)
              · norm(t,d)    (8)
            ) (t in q)    (4)
  1. score(q,d) is the relevance score of document d for query q.

  2. queryNorm(q) is the query normalization factor. [NEW]

  3. coord(q,d) is the coordination factor. [NEW]

  4. The sum of the weights for each term t in the query q for document d.

  5. tf(t in d) is the term frequency for term t in document d.

  6. idf(t) is the inverse document frequency for term t.

  7. t.getBoost() is the boost that has been applied to the query. [NEW]

  8. norm(t,d) is the field length norm, combined with the index-time field-level boost, if any. [NEW]

You should recognise score, tf, and idf. The queryNorm, coord, t.getBoost and norm are new.

We will talk more about query-time boosting later in this chapter, but first let’s get query normalization, coordination, and index-time field-level boosting out of the way.

Query normalization factor

The query normalization factor (queryNorm) is an attempt to ``normalize'' a query so that the results from one query may be compared with the results of another.

Tip

Even though the intent of the query norm is to make results from different queries comparable, it doesn’t work very well. Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.

This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved but a typical implementation would be:

queryNorm = 1 / √sumOfSquaredWeights (1)
  1. The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

Tip
The same query normalization factor is applied to every document and you have no way of changing it. For all intents and purposes, it can be ignored.

Query coordination

The coordination factor (coord) is used to reward documents which contain a higher percentage of the query terms — the more query terms that appear in the document, the greater the chances that the document is a good match for the query.

Imagine that we have a query for quick brown fox, and that the weight for each term is 1.5. Without the coordination factor, the score would just be the sum of the weights of the terms in a document. For instance:

  • Document with fox → score: 1.5

  • Document with quick fox → score: 3.0

  • Document with quick brown fox → score: 4.5

The coordination factor multiplies the score by the number of matching terms in the document, and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:

  • Document with fox → score: 1.5 * 1 / 3 = 0.5

  • Document with quick fox → score: 3.0 * 2 / 3 = 2.0

  • Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5

The coordination factor results in the document that contains all 3 terms being much more relevant than the document that contains just 2 of them.

Remember that the query for quick brown fox is rewritten into a bool query like this:

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "brown" }},
        { "term": { "text": "fox"   }}
      ]
    }
  }
}

The bool query uses query coordination by default for all should clauses, but it does allow you to disable coordination. Why might you want to do this? Well, usually the answer is: you don’t. Query coordination is usually a good thing. When you use a bool query to wrap several high-level queries like the match query, it also makes sense to leave coordination enabled. The more clauses that match, the higher the degree of overlap between your search request and the documents that are returned.

However, in some advanced use cases it might make sense to disable coordination. Imagine that you are looking for synonyms: jump, leap, and hop. You don’t care how many of these synonyms are present as they all represent the same concept. In fact, only one of the synonyms is likely to be present. This would be a good case for disabling the coordination factor:

GET /_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "term": { "text": "jump" }},
        { "term": { "text": "hop"  }},
        { "term": { "text": "leap" }}
      ]
    }
  }
}

In fact, when you use synonyms (see [synonyms]), this is exactly what happens internally — the rewritten query disables coordination for the synonyms. Most use cases for disabling coordination are handled automatically — you don’t need to worry about it.

Index time field-level boosting

We will talk about boosting a field — making it more important than other fields — at query time in [query-time-boosting] below. It is also possible to apply a boost to a field at index time. Actually, this boost is applied to every term in the field, rather than to the field itself.

In order to store this boost value in the index without using up more space than necessary, this field-level index-time boost is combined with the field length norm (see [field-norm]) and stored in the index as a single byte. This is the value returned by norm(t,d) in the formula above.

Warning

We strongly recommend against using field-level index-time boosts for a few reasons:

  1. Combining the boost with the field length norm and storing it in a single byte means that the field length norm loses precision. The result is that Elasticsearch is unable to distinguish between a field containing 3 words and a field containing 5 words.

  2. To change an index-time boost, you have to reindex all of your documents. A query-time boost, on the other hand, can be changed with every query.

  3. If a field with index-time boost has multiple values, the boost is multiplied with itself for every value, dramatically increasing the weight for that field.

Query-time boosting is a much simpler, cleaner, more flexible option.

With query normalization, coordination and index-time boosting out of the way, we can now move on to the most useful tool for influencing the relevance calculation: query-time boosting.