Reconsider using exact name matching in some circumstances #2689

daveverwer · 2023-11-01T14:38:02Z

daveverwer
Nov 1, 2023
Maintainer

We have "exact name matching" enabled for our search results so that packages with names that would not match well with Postgres text matching are still easy to find.

This approach works great right up until the point that it doesn't work great. 😂 It solves the problem of searching for something you know exists, especially if that name is not a dictionary word and won't do well with free text search.

Unfortunately, it does not work so well when the word is also a dictionary word. For example, networking, logger, and markdown, where each of those terms brings back exact match packages that are unmaintained or experimental above much higher-quality matches that don't have an exact match on the package name.

I wonder if we should consider running search queries through a dictionary filter before applying exact match logic to them?

I have an idea how to do this without needing to embed (and parse/search) a full-blown dictionary into the app, which would be overkill for what we need. It could look something like this:

We combine a list of keywords with a list of package names to get the “SPI dictionary”, which is a more relevant list of words we see package authors use. We might also consider splitting keywords on hyphens and underscores and package names on the same, plus camel-case words.
We then combine that list with an actual dictionary to find a common set of dictionary words that might exhibit the problem described above.

There’s a freely usable dictionary (with a very reasonable license) from WordNet from Princeton University, and we could keep things fast by doing the queries offline and generating a source file so that the comparison would not impose much (if any) performance penalty on searches. We do a similar thing with the GitHub community sponsors. We could run a tool like that once a month to keep the word list up-to-date.

I know this is straying into “you’re manipulating search results” territory, but there would be no choices being made by humans with this plan. The “manipulation” would only be based on package metadata. We’re also trying to create a search engine that gives good-quality results, and we are failing at that right now for quite common searches.

I’m open to all feedback, of course, but as grandmaster of the package search algorithm, I’d also love to hear if @heckj has any thoughts on this!

heckj · 2023-11-01T17:40:28Z

heckj
Nov 1, 2023
Collaborator Sponsor

👻 I HAVE BEEN SUMMONED!!! 👻

The idea is brilliant - and it matches a very common (OG) technique that's been used for years - "stopwords". There are words that have relatively little semantic meaning that appear SUPER frequently, and can "bog up" the search results for any sort of search implementation, but which are meaningless (in their own) to return (A, AND, THE kind of thing), so those have frequently been purged BEFORE an inverted index is created, removing them from impacting search results. As this went on, and search narrowed in specific domains (medical research), the same thing happened if there was an overabundance of terms - so people started adding stop words based on their local content indexes for to improve the relevancy of the results. That's exactly what you're proposing here. The postgres vector/search implementation even has some support for stopwords, and it looks like it's editable with some SQL commands to twiddle that dictionary configuration (built-in English stopwords already in place).

I'm not sure if it'll be more effective to adjust Postgres's STOPWORDS dictionary, or if it'll make more sense to just use a set of chosen stopwords as a filter on the exact-name-match portion, which is external to the ranking/relevancy thing. Since that in itself is an override on the plain ranking, I suspect filtering just that structure based on a set of keywords would make a lot more sense (and be a LOT easier to maintain).

Big question to the theory - how to choose the stopwords. I suspect you've got a pretty good list in your head already (maybe just the three examples above), but aside from explicit user feedback about the relevancy of the searches, it's a pretty manual task to know if you're serving good or bad results, and is super-heavily influenced by how people search and what terms they use. In an ideal world, we'd do some analysis on the poor relevancy results to consider additional edits into the StopWords filter (or any other tweaks that would appear to be needed for better relevancy)

Google used "click through" early on (maybe still? Not sure where they've gone these days) as a statistical feedback measure. It might be a lot of additional overhead that's not worth (time, effort) screwing around with - but for continuing to run with search as a core feature, building in some logistics to get some measures (appropriately anonymized, of course) that could lead us to where we need to adjust. The parts that would be valuable here are the search terms captured, paired with the results supplied. Potentially even asking people to assert ("Nope, I didn't want to see this") in the results list to get that explicit feedback. (It'll be better to capture the words presented rather than re-running the search queries alone later, as the index is always growing and changing, so it's hard to get a useful snapshot in time using the search terms a couple weeks or month later).

Relevancy scores are, unfortunately, subjective - so we're talking a review of some human, somewhere in the pipeline to maintain this sort of thing. When I worked at media search engine, it was a regular weekly task of some of the engineers - we'd generate a random collection of the paired search terms and results, and rate them (using the same people again and again - quite intentionally) and then base our changes based on the monthly fluctuations of the relevancy ranking reports those folks created each week. Needless to say, it was time intensive - and that's was with a company of ~65 people (not 2) who's sole purpose was building and maintaining the search index.

0 replies

finestructure · 2023-11-06T09:22:15Z

finestructure
Nov 6, 2023
Maintainer

I'm not convinced this is fixable via dictionaries/stop words unless we do a lot of SPI specific curation. For example, one of the reasons we introduced this change was the "Publish" package. I'm sure the word "publish" would feature in any dictionary or stop word list and so we'd effectively undo that original change.

Maybe it's worth undoing it but I think that's what it needs to be weighed against and if so, perhaps the better (certainly easier) solution is to simply remove the "boost to top" aspect of the exact match.

However, I'd also offer another, simpler solution. Right now we boost an exact name match straight to the top of the results. I think instead we could give it a score boost. That would still ensure that exact matches have a good chance to go to the top compared against "score peers", or at least pull it to the first page if it's a bit lacking in terms of score without bringing it all the way to the front (depending on the value we assign, of course).

2 replies

daveverwer Nov 7, 2023
Maintainer Author

However, I'd also offer another, simpler solution. Right now we boost an exact name match straight to the top of the results. I think instead we could give it a score boost. That would still ensure that exact matches have a good chance to go to the top compared against "score peers", or at least pull it to the first page if it's a bit lacking in terms of score without bringing it all the way to the front (depending on the value we assign, of course).

This is a great idea, and should be our first step with this.

heckj Nov 7, 2023
Collaborator Sponsor

I think even for score boosting/fusion/manipulation, you're likely going to have to do some specific SPI curation. It's the nature of the beast for a relatively limited set of data (and you're not in any way unique in this problem). Removing the boost on direct name match will help in a majority of cases, but it'll raise up others that get harder to find.

The fusion option (boosting the postgres FTS ranking score based on outside data) is totally possible, but it'll take some "magic number" tuning to get it so that it has some influence, but not so much that you'll be really out of whack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider using exact name matching in some circumstances #2689

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reconsider using exact name matching in some circumstances #2689

daveverwer Nov 1, 2023 Maintainer

Replies: 2 comments · 2 replies

heckj Nov 1, 2023 Collaborator Sponsor

finestructure Nov 6, 2023 Maintainer

daveverwer Nov 7, 2023 Maintainer Author

heckj Nov 7, 2023 Collaborator Sponsor

daveverwer
Nov 1, 2023
Maintainer

Replies: 2 comments 2 replies

heckj
Nov 1, 2023
Collaborator Sponsor

finestructure
Nov 6, 2023
Maintainer

daveverwer Nov 7, 2023
Maintainer Author

heckj Nov 7, 2023
Collaborator Sponsor