Reconsider using exact name matching in some circumstances #2689
Replies: 2 comments 2 replies
-
👻 I HAVE BEEN SUMMONED!!! 👻 The idea is brilliant - and it matches a very common (OG) technique that's been used for years - "stopwords". There are words that have relatively little semantic meaning that appear SUPER frequently, and can "bog up" the search results for any sort of search implementation, but which are meaningless (in their own) to return ( I'm not sure if it'll be more effective to adjust Postgres's STOPWORDS dictionary, or if it'll make more sense to just use a set of chosen stopwords as a filter on the exact-name-match portion, which is external to the ranking/relevancy thing. Since that in itself is an override on the plain ranking, I suspect filtering just that structure based on a set of keywords would make a lot more sense (and be a LOT easier to maintain). Big question to the theory - how to choose the stopwords. I suspect you've got a pretty good list in your head already (maybe just the three examples above), but aside from explicit user feedback about the relevancy of the searches, it's a pretty manual task to know if you're serving good or bad results, and is super-heavily influenced by how people search and what terms they use. In an ideal world, we'd do some analysis on the poor relevancy results to consider additional edits into the StopWords filter (or any other tweaks that would appear to be needed for better relevancy) Google used "click through" early on (maybe still? Not sure where they've gone these days) as a statistical feedback measure. It might be a lot of additional overhead that's not worth (time, effort) screwing around with - but for continuing to run with search as a core feature, building in some logistics to get some measures (appropriately anonymized, of course) that could lead us to where we need to adjust. The parts that would be valuable here are the search terms captured, paired with the results supplied. Potentially even asking people to assert ("Nope, I didn't want to see this") in the results list to get that explicit feedback. (It'll be better to capture the words presented rather than re-running the search queries alone later, as the index is always growing and changing, so it's hard to get a useful snapshot in time using the search terms a couple weeks or month later). Relevancy scores are, unfortunately, subjective - so we're talking a review of some human, somewhere in the pipeline to maintain this sort of thing. When I worked at media search engine, it was a regular weekly task of some of the engineers - we'd generate a random collection of the paired search terms and results, and rate them (using the same people again and again - quite intentionally) and then base our changes based on the monthly fluctuations of the relevancy ranking reports those folks created each week. Needless to say, it was time intensive - and that's was with a company of ~65 people (not 2) who's sole purpose was building and maintaining the search index. |
Beta Was this translation helpful? Give feedback.
-
I'm not convinced this is fixable via dictionaries/stop words unless we do a lot of SPI specific curation. For example, one of the reasons we introduced this change was the "Publish" package. I'm sure the word "publish" would feature in any dictionary or stop word list and so we'd effectively undo that original change. Maybe it's worth undoing it but I think that's what it needs to be weighed against and if so, perhaps the better (certainly easier) solution is to simply remove the "boost to top" aspect of the exact match. However, I'd also offer another, simpler solution. Right now we boost an exact name match straight to the top of the results. I think instead we could give it a score boost. That would still ensure that exact matches have a good chance to go to the top compared against "score peers", or at least pull it to the first page if it's a bit lacking in terms of score without bringing it all the way to the front (depending on the value we assign, of course). |
Beta Was this translation helpful? Give feedback.
-
We have "exact name matching" enabled for our search results so that packages with names that would not match well with Postgres text matching are still easy to find.
This approach works great right up until the point that it doesn't work great. 😂 It solves the problem of searching for something you know exists, especially if that name is not a dictionary word and won't do well with free text search.
Unfortunately, it does not work so well when the word is also a dictionary word. For example, networking, logger, and markdown, where each of those terms brings back exact match packages that are unmaintained or experimental above much higher-quality matches that don't have an exact match on the package name.
I wonder if we should consider running search queries through a dictionary filter before applying exact match logic to them?
I have an idea how to do this without needing to embed (and parse/search) a full-blown dictionary into the app, which would be overkill for what we need. It could look something like this:
There’s a freely usable dictionary (with a very reasonable license) from WordNet from Princeton University, and we could keep things fast by doing the queries offline and generating a source file so that the comparison would not impose much (if any) performance penalty on searches. We do a similar thing with the GitHub community sponsors. We could run a tool like that once a month to keep the word list up-to-date.
I know this is straying into “you’re manipulating search results” territory, but there would be no choices being made by humans with this plan. The “manipulation” would only be based on package metadata. We’re also trying to create a search engine that gives good-quality results, and we are failing at that right now for quite common searches.
I’m open to all feedback, of course, but as grandmaster of the package search algorithm, I’d also love to hear if @heckj has any thoughts on this!
Beta Was this translation helpful? Give feedback.
All reactions