Skip to content

Latest commit

 

History

History
225 lines (192 loc) · 7.54 KB

60_cardinality.asciidoc

File metadata and controls

225 lines (192 loc) · 7.54 KB

Finding Distinct Counts

The first approximate aggregation provided by Elasticsearch is the cardinality metric. This provides the cardinality of a field, also called "distinct" or "unique" count. You may be familiar with the SQL version:

SELECT DISTINCT(color)
FROM cars

Distinct counts are a common operation, and answer many fundamental business questions:

  • How many unique visitors have come to my website?

  • How many unique cars have we sold?

  • How many distinct users purchased a product each month?

We can use the cardinality metric to determine the number of car colors being sold at our dealership:

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}

Which returns a minimal response show that we have sold three different colored cars:

...
"aggregations": {
  "distinct_colors": {
     "value": 3
  }
}
...

We can make our example more useful: how many colors were sold each month? For that metric, we just nest the cardinality metric under a date_histogram:

GET /cars/transactions/_search?search_type=count
{
  "aggs" : {
      "months" : {
        "date_histogram": {
          "field": "sold",
          "interval": "month"
        },
        "aggs": {
          "distinct_colors" : {
              "cardinality" : {
                "field" : "color"
              }
          }
        }
      }
  }
}

Understanding the Tradeoffs

As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and using the bits from the hash to make probabilistic estimations on the cardinality.

You don’t need to understand the technical details (although if you’re interested, the paper is a great read!), but you should be aware of the properties of the algorithm:

  • Configurable precision, which decides controls memory usage (more precise == more memory)

  • Excellent accuracy on low-cardinality sets

  • Fixed memory usage. No matter if there are thousands or billions of unique values, memory usage only depends on the configured precision.

To configure the precision, you must specify the precision_threshold parameter. This threshold defines the point under which cardinalities are expected to be very close to accurate. So for example:

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color",
              "precision_threshold" : 100 (1)
            }
        }
    }
}
  1. precision_threshold accepts a number from 0 - 40,000. Larger values are treated equivalent to 40,000.

Will ensure that fields with 100 or fewer distinct values will be extremely accurate. Although not guaranteed by the algorithm, if a cardinality is under the threshold it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings and a little error will creep into the metric.

For a given threshold, the HLL data-structure will use about precision_threshold * 8 bytes of memory. So you must balance how much memory you are willing to sacrifice for additional accuracy.

Practically speaking, a threshold of 100 maintains an error under 5% even when counting millions of unique values.

Optimizing for speed

If you want a distinct count, you usually want to query your entire data-set (or nearly all of it). Any operation on all your data needs to execute quickly, for obvious reasons. HyperLogLog is very fast already — it simply hashes your data and does some bit-twiddling.

But if speed is very important to you, we can optimize it a little bit further. Since HLL simply needs the hash of the field, we can pre-compute that hash at index time. When the query executes, we can skip the hash computation and load the value directly out of fielddata.

Note

Pre-computing hashes is only useful on very large and/or high-cardinality fields. Calculating the hash on these fields is non-negligible at query-time.

However, numeric fields hash very quickly and storing the original numeric often requires the same (or less) memory. This is also true on low-cardinality string fields — there are internal optimizations which guarantee that hashes are calculated only once per unique value.

Basically, pre-computing hashes is not guaranteed to make all fields faster — only those that have high cardinality and/or large strings. And remember, pre-computing simply shifts the cost to index-time. You still pay the price, you just choose when to pay it.

To do this, we need to add a new multi-field to our data. We’ll delete our index, add a new mapping which includes the hashed field, then reindex:

DELETE /cars/

PUT /cars/
{
  "mappings": {
    "color": {
      "type": "string",
      "fields": {
          "hash": {
              "type": "murmur3" (1)
          }
      }
    }
  }
}

POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }
  1. This multi-field is of type murmur3, which is a hashing function

Now when we run an aggregation, we use the "color.hash" field instead of the "color" field:

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color.hash" (1)
            }
        }
    }
}
  1. Notice that we specify the hashed multi-field, rather than the original

Now the cardinality metric will load the values (e.g. the pre-computed hashes) from "color.hash" and use those in place of dynamically hashing the original value.

The savings per document is small, but if hashing each field adds an additional 10ns and your aggregation touches 100m documents…​that adds an additional 1s per query. If you find yourself using cardinality across many documents, perform some profiling to see if pre-computing hashes makes sense for your deployment.