Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompound Words For German Languange #81

Open
keywan-ghadami-oxid opened this issue Jan 31, 2022 · 15 comments
Open

Decompound Words For German Languange #81

keywan-ghadami-oxid opened this issue Jan 31, 2022 · 15 comments
Labels
enhancement New feature or request

Comments

@keywan-ghadami-oxid
Copy link

In German language it is common to combine nouns without whitespace.
e.g.

apple => Apfel
tree => Baum
apple tree => Apfelbaum (no white space between the two words)

Having that said, searching for "Baum" (tree) should also give a hit for the apple tree. If there are documents with "Baum" and "Apfelbaum" then user may expect that the document with "Baum" is higher ranked, but they also expect to find "Apfelbaum" within the result.

In Elasticsearch there is a HyphenationCompoundWordTokenFilter that split words by using a hyphon ruleset and a word list. The hyphon ruleset helps to avoid splitting words in a wrong way and may speedup the search for words within other words.

Anyway any simple tokenizer that uses a word list to split the words would help a lot.

@ChillFish8 ChillFish8 added the enhancement New feature or request label Jan 31, 2022
@ChillFish8
Copy link
Collaborator

ChillFish8 commented Jan 31, 2022

This should actually already be done via the fast-fuzzy system to some extent (although this could definitely be added to the traditional fuzzy system)

In theory the system should split apfelbaum into apfel baum providing the documents contain a high frequency count of apfel or baum and not apfelbaum (if it matches exactly the system wont try segment it) unless something else is closer / occurs more commonly.

The second part of the idea you mentioned isn't currently implemented automatically i.e joining parts back up although this is slightly possible via synonyms doing apfel:apfelbaum as a mapping but I can see the possible use for doing this automatically (although maybe this could be an opt in feature and/or only done when the system has started to run out of options on the original method)

You can also manually tell the system to segment while keeping the original word via adding the synonym mapping apfelbaum:apfel,baum

@keywan-ghadami-oxid
Copy link
Author

The fast-fuzzy system without synonyms does not give good results in my test. By using the synonyms the way you suggested also didn't work for me:

Searching for "Baum" in a dataset containing "Apfelbaum" did only return the documents with exact match "Baum"

 "query": [
        {
            "fuzzy": {
                "ctx": "Baum"
            },
            "occur": "must"
        }
    ]

with synonyms (already added before indexing)

{"apfelbaum":["baum","apfel"],"apfelschale":["schale","apfel"],"nussbaum":["baum","nuss"]}

Using synonyms the other way round improved the results a lot:
Apfel,Baum:Apfelbaum

That way searching for Baumor Apfel also returns the document "Apfelbaum" as result.

search:

{
    "query": [
        {
            "fuzzy": {
                "ctx": "Baum"
            },
            "occur": "must"
        }
    ]
}

with synonyms:
{"schale":["apfelschale"],"apfel":["apfelschale","apfelbaum"],"baum":["apfelbaum","nussbaum"],"nuss":["nussbaum"]}

result:

with score:410.13638 title:Apfelbaum
with score:410.13638 title:Nussbaum
with score:410.13638 title:Baum
with score:23.718506 title:Zierleiste

Sadly in that case "Apfelbaum" goes before the exact match which I think could be improved in LNX because exact match should always have higher ranking than synonym matches.

But even if this could be improved, I am not sure how to create this synonyms automatically.

I guess without this feature LNX can not be used with German language to produce good results.

@ChillFish8
Copy link
Collaborator

Yeah I think we should do some sort of boost adjustment to have the system display the results better.

Searching for "Baum" in a dataset containing "Apfelbaum" did only return the documents with exact match "Baum"

This is quite a difficult problem to solve in terms of deciding how the system should match close terms or if we should try something more radical like modifying the tokenized text in such a way that it matches those sorts of things.
But for now I'm not sure how easy improving it would be realistically. I don't think you'll match Apfelbaum for the term Baum in many existing systems anyway right now because most are prefixed based (although I'm not saying that's ideal)

@ChillFish8
Copy link
Collaborator

I am not sure how to create this synonyms automatically.

That is ultimately the biggest issue plaguing relevancy with stuff like this, how and when should stuff be 'similar'/ related without human intervention

@michaelgrigoryan25
Copy link

This is the case with some Armenian words too.

Maybe we could try using something like Hugging Face Tokenizers or Hugging Face Transformers and use NLP models (in this case stop words) distributed online, instead of creating new ones from scratch (and if they do not exist we will create them separately for each missing language of course)? There is this library that I found called rust_bert, which essentially is a port of Hugging Face's official transformer API to Rust and it seems like it is very well-maintained.

This might heavily impact the performance though. What do you think?

@ChillFish8
Copy link
Collaborator

Hugging face have all their tokenizers in Rust to begin with it's what the backbone of their system is in.

But yes, I'm slightly concerned to the performance impact it would have (which I imagine is a lot).

This is the case with some Armenian words too.

Armenian words also suffer from the system having slightly conflicting methods of normalizing the unicode right now so it's largely trying its hardest but its difficult when two systems are potentially producing different conversions (for fast fuzzy that is)
this issue also runs with CJK languages as well.

@ChillFish8
Copy link
Collaborator

I think the tokenizing system could do with be played around with to try improve relevancy in places, especially for non-whitespace separated languages.

@michaelgrigoryan25
Copy link

I see. Let's say that .. (or something similar) is an optional whitespace specifier. So, in the stop words file, we will just need to specify apfel..baum and let the tokenizer do its thing.

What do you think about drafting a protocol for this? Or there is no need for doing so?

@ChillFish8
Copy link
Collaborator

that would be an idea, although im not sure how well that would work in practice if you have hundreds of these words it's going to get pretty tedious.
I think we should start by trying to improve the automatic tokenization first before trying to do manual special casing (although that certainly would still be a good idea)

@ChillFish8
Copy link
Collaborator

ChillFish8 commented Feb 1, 2022

@keywan-ghadami-oxid @michaelgrigoryan25 I have an experimental branch relevancy-tests if you fancy trying that out and giving feedback on how the relevancy is in your respective languages. Although it's slightly out of tune atm it might be a possible solution.

This does tank the English relevancy however.

@michaelgrigoryan25
Copy link

@keywan-ghadami-oxid @michaelgrigoryan25 I have an experimental branch relevancy-tests if you fancy trying that out and giving feedback on how the relevancy is in your respective languages. Although it's slightly out of tune atm it might be a possible solution.

This does tank the English relevancy however.

Sure!

@keywan-ghadami-oxid
Copy link
Author

You change improves the relevance a lot. I did not need any synonym settings, and already did a lot of testing.
One downside of the change is that indexing is now a lot slower. Maybe it should be configurable which tokenizer to use.
Anyway thank you already for this great improvement.

@ChillFish8
Copy link
Collaborator

Yeah... It is considerably slower and also affects the English relevancy quite negatively. Although it seems like a step in the right direction.

@keywan-ghadami-oxid
Copy link
Author

I guess indexing all nGrams is quite heavy and much more then it needs (sometimes even bad). I was wondering if using a rust hyphenation library to first check good positions to split the words in combination with matching the tokens against a language specific dictionary could drastically reduce the amount of tokens. see this
https://github.com/uschindler/german-decompounder

Another (maybe stupid or genius) idea would be to use the word list (together with some stemming and hyphen rules) to build a giant regular expression that can be used to split any string into words in linear time.

@ChillFish8
Copy link
Collaborator

ChillFish8 commented Feb 4, 2022

So I've worked out a system that should work as best as we can do in reality without compromising on relevancy for other languages or indexing time.

Hopefully should have a test system available soon but for the most part, it brings prefix searching for free with no additional overhead to what already exists with lnx i.e apple will match appletree. But also with the ability to opt into suffix support. The reason why this is opt-in is that it adds an additional load to memory usage when processing a commit and can potentially increase commit times (although not by much), this gives allows you to match appletree when you search for the query tree if enabled.

Note there's one real caveat:
Prefix and Suffix search only works for words that are under 7 characters long. In theory, this can be increased but this comes at the cost of additional processing time, slower searches and less relevancy for most other words. i.e words like apple/Apfel will match appletree/Apfelbaum respectively but words like wonderer won't have a prefix search potentially missing terms like wonderers realistically though most words where you would have a prefix situation probably fit in under 6 characters. This same logic applies to suffix searching.

The plus side is though that this is essentially free (minus some commit time for suffix search) for us to use and should hopefully drastically improve relevancy for situations like this.

note: when I say it will increase commit time I mean it goes from talking about 8s to processing 400,000 unique words to 17s when suffix processing is enabled. (400,000 unique words roughly is ~5 million document database of arbitrary user data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants