Skip to content

How Search in 1.6 works

Chris Li edited this page Jun 26, 2016 · 21 revisions

Now imagine you have 1 zim file and its idx folder. You can perform 2 type of search, title search (searchSuggestionSmart) and indexed search (xapian). You may get very different results from them, since they work very differently.

So the question is, how do we get the best of both worlds?

Step 1: Perform both searches

Perform both searches and get an array of results from each search. So now you have two arrays, both containing objects like this:

class SearchResult {
    let bookID: String
    let title: String
    let path: String?
    let snippet: String?
    let probability: Double?
}

Step 2: Create hash table

It is likely that you may get the same article twice from the search above. We create a [title: SearchResult] or [path: SearchResult] dic/hash to get rid of article that appeared twice.

In 1.6 BUILD 705, I use [title: SearchResult]. And choose to get rid of title search result and keep xapian search result if needed, since xapian results have snippets.

Step 3: Calculate Levenshtein distance

Now we need a proper way to sort SearchResult objects from the dic/hash above. We don't want to use caseInsensitiveCompare on title, that'd be the same as title search. So we need to create a new metric.

In 1.6 I calculate Levenshtein distance on lower case string of search term and result titles. It describes similarity between search term & title.

You may have noticed a big flaw in our system till this step. Our system gives xapian results a great disadvantage. The most relevant xapian result may have much longer titles than the search term, but Levenshtein distance of such result is long and the rank is low. We will address it in the next step!

Step 4: Utilize xapian probability

If xapian gives a search result 100% probability, what does it mean? Well, it means xapian is quite sure this is exactly the article you want. It is very confident. So, we need to give such an article a boost, despite they may have (very) long Levenshtein distance to our search term. On the other hand, if xapian gives an article very low probability, then this article may have nothing to do with our search term. Wen may need to give such articles a penalty, even though their title may share great similarity with the search term. (much less likely to happen)

A lot of times, xapian results congregate on the high end of the probability range. In other words, you get a lot of articles with 100%, 99%, etc., but not a lot with 75%, 66%, 45% and things like that. To better differentiate them, we need a non-liner map from probability to a boost/penalty factor. This also allow us not to give too much penalty on xapian result with low probability.

In 1.6, I use f(x) = ln(n - m * prob) as this boost factor, the default value is derived from:

  • ln(n - m * 1.00) = 0.1
  • ln(n - m * 0.75) = 1.0

Solve for m & n, you get m = 6.4524 and n = 7.5576, see plot here.

The first order derivative of f(x) is f'(x) = 1 / (x - 1.17129). f(x) decrease faster when x = 1, slower when x = 0.

Step 5: Calculate rank score

  • For indexed search results: RankScore = Levenshtein distance * f(x)
  • For title search results: RankScore = Levenshtein distance * 1

Sort ascending by RankScore gives you final search ranks!

Note: In version 1.6 alphas, you can head over to Setting -> Boost Factor 🚀 to set your own (x1, y1) & (x2, y2), tap calculate, the app will solve m and n for you and use it in the next search.