Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving search with Levenshtein distance or similar algorithms #5325

Open
1 task done
KuramaSyu opened this issue Nov 15, 2024 · 3 comments
Open
1 task done

Improving search with Levenshtein distance or similar algorithms #5325

KuramaSyu opened this issue Nov 15, 2024 · 3 comments

Comments

@KuramaSyu
Copy link

Describe the feature you'd like

When searching something, I sometimes don't find what I need, because I have a typo or just named the title slightly different. An example would be if I search
settings.json but the title is settings json - or the other way.
An other example would be searching Linxu instead of Linux, where I currently wouldn't get any results.

I would be happy to contribute to it, but first I just want to ask, if that is even wanted.

Describe the benefits this would bring to existing BookStack users

Better search feature, hence better overall experience, since search is in my opinion one of the most important features.

Can the goal of this request already be achieved via other means?

Yes, there is an issue which wants to use AI implementation, but that would be way more expensive then using levenshtein distance algorithm. Even if thats not the main purpose of issue #5318.

Have you searched for an existing open/closed issue?

  • I have searched for existing issues and none cover my fundamental request

How long have you been using BookStack?

1 to 5 years

Additional context

No response

@KuramaSyu KuramaSyu changed the title Improving search with Levenshtein distance Improving search with Levenshtein distance or similar algorithms Nov 15, 2024
@ssddanbrown
Copy link
Member

Hi @KuramaSyu,
Thanks for offering to contribute.

I would be happy to consider changes if they can fit into the current general structure & scope of how search has been implemented, where added complexity is minimal and where existing efficiency strategies (like use of database indexes for normal search terms) can remain.

Even if a solution can fit the above, then we'd need to evaluate the end result.
Early on I made use of MySQL fulltext indexes, which bring levels of their own fuzzy logic, but that kind of logic can bring its own challenges & complexities, which led me to switch to a simpler & predictable exact-prefix match like we have now, rather than delve into exploring and supporting various levels of fine tuning, to suit various content/environments, that can be more desired in fuzzier searches.

For anything too complex (in terms of added technology/dependency requirements or logical implementation) I'd view it like with LLMs, where I'd rather look to provide interfaces for external options instead of supporting ourselves.

@KuramaSyu
Copy link
Author

KuramaSyu commented Nov 17, 2024

@ssddanbrown could you tell me, where I would need to search in the repo? I found the search dir, but I have no idea, what the "startpoint" there is Ans where the sql statements are made.

Since mysql is used, I searched for it. There is a solution from mysql called soundex which calculates, if 2 strings sound similar, and the other option is manually adding a mysql func for levenshtein. And for sure looking if levenshtein is fast enough and in some way even possible when comparing words to full titles which are of course not similar.

Another option would be using PostgreSQL, since it has a similarity function build in which works quite well. But I guess this is not possible

@ssddanbrown
Copy link
Member

@KuramaSyu

Indexing is done here: https://github.com/BookStackApp/BookStack/blob/development/app/Search/SearchIndex.php
Searching is ran here: https://github.com/BookStackApp/BookStack/blob/development/app/Search/SearchRunner.php

Logic Summary

During indexing, content for an entity (book/chapter/page) is split into words, with words reduced down to a score for per entity per word, with frequency and location (titles and headings are boosted for example) impacting that score. This is all stored in a search_terms table.

When a search is performed, we split out normal search terms, then query against search_terms, summing the combined score (incoming term score and matched term score) for matches to use as an order/filter for the search results.

The incoming (search query) terms also have their own score adjustments made based on their frequency, to bump the score of less common words. This is to act like some level of cheap runtime tf–idf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants