Improving search with Levenshtein distance or similar algorithms #5325

KuramaSyu · 2024-11-15T17:12:02Z

Describe the feature you'd like

When searching something, I sometimes don't find what I need, because I have a typo or just named the title slightly different. An example would be if I search
settings.json but the title is settings json - or the other way.
An other example would be searching Linxu instead of Linux, where I currently wouldn't get any results.

I would be happy to contribute to it, but first I just want to ask, if that is even wanted.

Describe the benefits this would bring to existing BookStack users

Better search feature, hence better overall experience, since search is in my opinion one of the most important features.

Can the goal of this request already be achieved via other means?

Yes, there is an issue which wants to use AI implementation, but that would be way more expensive then using levenshtein distance algorithm. Even if thats not the main purpose of issue #5318.

Have you searched for an existing open/closed issue?

I have searched for existing issues and none cover my fundamental request

How long have you been using BookStack?

1 to 5 years

Additional context

No response

ssddanbrown · 2024-11-16T20:12:48Z

Hi @KuramaSyu,
Thanks for offering to contribute.

I would be happy to consider changes if they can fit into the current general structure & scope of how search has been implemented, where added complexity is minimal and where existing efficiency strategies (like use of database indexes for normal search terms) can remain.

Even if a solution can fit the above, then we'd need to evaluate the end result.
Early on I made use of MySQL fulltext indexes, which bring levels of their own fuzzy logic, but that kind of logic can bring its own challenges & complexities, which led me to switch to a simpler & predictable exact-prefix match like we have now, rather than delve into exploring and supporting various levels of fine tuning, to suit various content/environments, that can be more desired in fuzzier searches.

For anything too complex (in terms of added technology/dependency requirements or logical implementation) I'd view it like with LLMs, where I'd rather look to provide interfaces for external options instead of supporting ourselves.

KuramaSyu · 2024-11-17T00:58:10Z

@ssddanbrown could you tell me, where I would need to search in the repo? I found the search dir, but I have no idea, what the "startpoint" there is Ans where the sql statements are made.

Since mysql is used, I searched for it. There is a solution from mysql called soundex which calculates, if 2 strings sound similar, and the other option is manually adding a mysql func for levenshtein. And for sure looking if levenshtein is fast enough and in some way even possible when comparing words to full titles which are of course not similar.

Another option would be using PostgreSQL, since it has a similarity function build in which works quite well. But I guess this is not possible

ssddanbrown · 2024-11-17T02:42:49Z

@KuramaSyu

Indexing is done here: https://github.com/BookStackApp/BookStack/blob/development/app/Search/SearchIndex.php
Searching is ran here: https://github.com/BookStackApp/BookStack/blob/development/app/Search/SearchRunner.php

Logic Summary

During indexing, content for an entity (book/chapter/page) is split into words, with words reduced down to a score for per entity per word, with frequency and location (titles and headings are boosted for example) impacting that score. This is all stored in a search_terms table.

When a search is performed, we split out normal search terms, then query against search_terms, summing the combined score (incoming term score and matched term score) for matches to use as an order/filter for the search results.

The incoming (search query) terms also have their own score adjustments made based on their frequency, to bump the score of less common words. This is to act like some level of cheap runtime tf–idf.

KuramaSyu added the 🔨 Feature Request label Nov 15, 2024

KuramaSyu changed the title ~~Improving search with Levenshtein distance~~ Improving search with Levenshtein distance or similar algorithms Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving search with Levenshtein distance or similar algorithms #5325

Improving search with Levenshtein distance or similar algorithms #5325

KuramaSyu commented Nov 15, 2024

ssddanbrown commented Nov 16, 2024

KuramaSyu commented Nov 17, 2024 •

edited

Loading

ssddanbrown commented Nov 17, 2024

Improving search with Levenshtein distance or similar algorithms #5325

Improving search with Levenshtein distance or similar algorithms #5325

Comments

KuramaSyu commented Nov 15, 2024

Describe the feature you'd like

Describe the benefits this would bring to existing BookStack users

Can the goal of this request already be achieved via other means?

Have you searched for an existing open/closed issue?

How long have you been using BookStack?

Additional context

ssddanbrown commented Nov 16, 2024

KuramaSyu commented Nov 17, 2024 • edited Loading

ssddanbrown commented Nov 17, 2024

Logic Summary

KuramaSyu commented Nov 17, 2024 •

edited

Loading