As useful as phrase and proximity queries can be, they still have a downside.
They are overly strict: all terms must be present for a phrase query to match,
even when using slop
.
The flexibility in word ordering that you gain with slop
also comes at a
price, because you lose the association between word pairs. While you can
identify documents where sue
, alligator
and ate
occur close together,
you can’t tell whether Sue ate…'' or the
alligator ate…''.
When words are used in conjunction with each other, they express an idea that
is bigger or more meaningful than each word in isolation. The two clauses
I’m not happy…'' and
Happy I’m not…'' contain the sames words, in
close proximity, but have quite different meanings.
If, instead of indexing each word independently, we were to index pairs of words then we could retain more of the context in which the words were used.
For the sentence "Sue ate the alligator"
, we would not only index each word
(or unigram) as a term:
["sue", "ate", "the", "alligator"]
but also each word and its neighbour as single terms:
["sue ate", "ate the", "the alligator"]
These word pairs (or bigrams) are known as shingles.
Tip
|
Shingles are not restricted to being pairs of words; you could index word triplets (trigrams) as well: ["sue ate the", "ate the alligator"] Trigrams give you a higher degree of precision, but greatly increases the number of unique terms in the index. Bigrams are sufficient for most use cases. |
Of course, shingles are only useful if the user enters their query in the same
order as in the original document; a query for "sue alligator"
would match
the individual words but none of our shingles.
Fortunately, users tend to express themselves using similar constructs to those that appear in the data they are searching. But this point is an important one: it is not enough to index just bigrams; we still need unigrams, but we can use matching bigrams as a signal to increase the relevance score.
Shingles need to be created at index time as part of the analysis process. We could index both unigrams and bigrams into a single field, but it is cleaner to keep unigrams and bigrams in separate fields which can be queried independently. The unigram field would form the basis of our search, with the bigram field being used to boost relevance.
First, we need to create an analyzer which uses the shingle
token filter:
DELETE /my_index
PUT /my_index
{
"settings": {
"number_of_shards": 1, (1)
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2, (2)
"max_shingle_size": 2, (2)
"output_unigrams": false (3)
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter" (4)
]
}
}
}
}
}
-
The default min/max shingle size is
2
so we don’t really need to set these. -
The
shingle
token filter outputs unigrams by default, but we want to keep unigrams and bigrams separate. -
The
my_shingle_analyzer
uses our custommy_shingles_filter
token filter.
First, let’s test that our analyzer is working as expected with the analyze
API:
GET /my_index/_analyze?analyzer=my_shingle_analyzer
Sue ate the alligator
Sure enough, we get back three terms:
-
"sue ate"
-
"ate the"
-
"the alligator"
Now we can proceed to setting up a field to use the new analyzer.
We said that it is cleaner to index unigrams and bigrams separately, so we
will create the title
field as a multi-field (see Multi-fields):
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
With this mapping, values from our JSON document in the field title
will be
indexed both as unigrams (title
) and as bigrams (title.shingles
), meaning
that we can query these fields independently.
And finally, we can index our example documents:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }
To understand the benefit that the shingles
field adds, let’s first look at
the results from a simple match
query for ``The hungry alligator ate Sue'':
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": "the hungry alligator ate sue"
}
}
}
This query returns all three documents, but note that documents 1
and 2
have the same relevance score because they contain the same words:
{
"hits": [
{
"_id": "1",
"_score": 0.44273707, (1)
"_source": {
"title": "Sue ate the alligator"
}
},
{
"_id": "2",
"_score": 0.44273707, (1)
"_source": {
"title": "The alligator ate Sue"
}
},
{
"_id": "3", (2)
"_score": 0.046571054,
"_source": {
"title": "Sue never goes anywhere without her alligator skin purse"
}
}
]
}
-
Both documents contain
the
,alligator
andate
and so have the same score. -
We could have excluded document
3
by setting theminimum_should_match
parameter. See [match-precision].
Now let’s add the shingles
field into the query. Remember that we want
matches on the shingles
field to act as a signal — to increase the
relevance score — so we still need to include the query on the main title
field:
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "the hungry alligator ate sue"
}
},
"should": {
"match": {
"title.shingles": "the hungry alligator ate sue"
}
}
}
}
}
We still match all three documents, but document 2
has now been bumped into
first place because it matched the shingled term "ate sue"
.
{
"hits": [
{
"_id": "2",
"_score": 0.4883322,
"_source": {
"title": "The alligator ate Sue"
}
},
{
"_id": "1",
"_score": 0.13422975,
"_source": {
"title": "Sue ate the alligator"
}
},
{
"_id": "3",
"_score": 0.014119488,
"_source": {
"title": "Sue never goes anywhere without her alligator skin purse"
}
}
]
}
Even though our query included the word "hungry"
, which doesn’t appear in
any of our documents, we still managed to use word proximity to return the
most relevant document first.
Not only are shingles more flexible than phrase queries, they perform better
as well. Instead of paying the price of a phrase query every time you search,
queries for shingles are just as efficient as a simple match
query.
There is a small cost that is payed at index time because more terms need to be indexed, which also means that fields with shingles use more disk space. However, most applications write once and read many times, so it makes sense to optimize for fast queries.
This is a theme that you will encounter frequently in Elasticsearch: it makes it possible to achieve a lot with your existing data, without requiring any setup. But once you understand your requirements better it is worth putting in the extra effort to model your data at index time. A little bit of preparation will help you to achieve better results with better performance.