Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match based on query with tags/placeables removed #1967

Open
transl8bzimport opened this issue Jul 1, 2011 · 3 comments
Open

Match based on query with tags/placeables removed #1967

transl8bzimport opened this issue Jul 1, 2011 · 3 comments

Comments

@transl8bzimport
Copy link

Originally posted by Marce van Velden:

Consider the folowing sourcetext
This is a sample text with a link

If this is sent to the tmserver i would like it to search for "This is a sample text with a link" (the source without tags) and give a quality penalty if the tags dont match:
i.e. The tmserver might have a match like:
This is a sample text with a link

This requires to save both the complete sourcetext and the sourcetext with tags removed in the tm db.

What do you think about this? I have made a sample implementation for tmserver (sqlite) for this in the past, and it worked perfectly for us. Though one question is which tags/placeables you will filter out. Possibly we could make this settable in the configuration.

@friedelwolff
Copy link
Member

I have similar ideas that I want to investigate, but haven't had time to realise any of them yet.

One possibility that someone mentioned, is to filter out certain token types when indexing. We already filter out when tokid=12. I don't know what the best reference is, but you can have a look here:

http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html
and in tmdb.py you can look for "tokid".

We will need to work out how we take this into account for the weighting. For start we can still just use the normal Levenshtein distance as we do know, or consider weighted averages of different runs on the full text vs the reduced/stripped version, or maybe use the database rank to affect our own rank. I don't know if this makes sense, so feel free to discuss further :-)

@transl8bzimport
Copy link
Author

Originally posted by Marce van Velden:

I think we should use the Levenshtein distance based on the stripped version of source and target and have a fixed but configurable penalty for tag/token/placeable mismatch if stripped sources are equal but tag/token/placeables do not match

@friedelwolff
Copy link
Member

Yes, I agree. I think the issue will be mostly to figure out which ones we feel are the relevant ones. Doing a weighted average of Levenshtein distances on the full text and the reduced text is a way of doing the penalties, I guess.

For the implementation I guess we might want to either keep both the original and reduced version in the database, or simply rely on the database rank more. The value from the ranking function is not easy to work with directly (as in, its inherent meaning isn't obvious), but we mostly get it for free, as far as I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants