Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect altnames that are a substring of name.default #548

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Feb 4, 2021

This change is an attempt to mitigate scoring penalties applied to documents with alternate names (#507).

It handles the case where an alt name is merely a substring of the main name, for example on the Union Square subway stop in OSM:

image

Alt names like this don't add much value: they don't allow searching on any new terms, but do throw off the scoring. Even when we fix the scoring issue, duplicate alt names that add no value still take up space, so this change should be useful for a while.

The change comes in 2 parts, each in their own commit:

  • what is intended to be a pure refactoring of the tag mapping logic. Rather than loop through all tags in whatever order they come, start with the tags that can go into name.default, then the other name.* fields, and finally the rest. This makes it easier to write logic that looks for duplicates
  • the actual change to check for the substring match, which is quite small

I'd be happy to extend this in the future with other near-identical alt names, such as handling Mc Donalds vs McDonalds or ignoring quotes or other special characters like in pelias/api#1488.

@@ -892395,8 +892389,7 @@
"_type": "_doc",
"data": {
"name": {
"default": "IPOH Asian House"
},
"default": "IPOH Asian House" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-issue: this is weirdly indented

Copy link
Member Author

@orangejulius orangejulius Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, oops! My fault! I had a try at editing the fixture manually since there were only a few changes.

Should have left it to the machines.

Copy link
Member

@missinglink missinglink Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, it's really not an issue, I just noticed it a few times in the PR, like the trim( tags[key])) catches my eye.

the Pelias code styling used to be more hit-and-miss, I've been using autoformatting in my editor for a while now which hopefully fixes a bunch of the little ones.

at some point I'd still love to fully adopt standardJS.

This makes it easier to add custom logic by working through the tags in
a specified order.
This handles the case where one alt name is a substring fully contained
in another.
@orangejulius
Copy link
Member Author

I came across this PR today and wanted to see if it still made a difference, so I've rebased it and kicked off a planet build to test things out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants