-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update stopwords.txt #14075
base: main
Are you sure you want to change the base?
Update stopwords.txt #14075
Conversation
In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words. For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake. This pull request add the word to the list and remove the duplication.
While its a simple change, it does change the analysis chain. I wonder if it should stick to Lucene 11 (admittedly, that will not be shipped for a LONG time). I wonder what others think on if we should backport to 10.2? |
Is there something I can help investigate further? I tried parsing the error and it seems to relate to a mn_MN dictionary that refers to Mongolian 🤔
I am unable to understand how my change could impact the analysis on the Mongolian language. |
I don't understand the connection to your change either, but it looks to me as if Mongolian is coming in via the randomized test framework, which sets the active Java Locale to something random. If you were to capture the full "reproduce with" command line that is output by the test framework when there is a failure, we could tell if that's what's going on. Are you able to reproduce the failure? |
I reran that check to see what happens, if it reproduces. This "extra regressions" check is also doing unpinned shallow I guess it is also possible it could fail (e.g. network issue) in such a way that the clone is unsuccessful? It is unclear to me if it is really cloning with the git binary, or if it is doing some pure-java (jgit etc) clone that might be less robust. And yeah, it could also be a randomization issue such as locale/charset. |
It fails again. maybe problem comes from LibreOffice/dictionaries@d169602 ? |
And yeah i see the commit date, but that's not the push date. So I suspect this issue has nothing to do with your PR and may fail all PRs until we address it. |
You mean this?
|
I'm happy to try to debug this but it might be a few days. Issue may be with REP rules in the referenced commit.
if the count (3619) is incorrect, then parser might try to parse a rule (such as It could also be some other bug in the code, this is just what it looks like to me. |
Yes, that's it. the |
I will send them a one-liner PR explaining the situation, we can take it from there. We may want to separately try to be more lenient about this part of the parsing. Have not looked at the C implementation, but I'm guessing the numbers might be used as more of a "hint" to size hashtables or something, and not really a hard rule? |
Their contribution process seem to be elsewhere, but I could not understand it fully 😅 I tried sending the issue to their IRC channel, hope someone sees it. |
Thank you @eusousu I tried to make some progress, for now at least I have an open bug report: https://bugs.documentfoundation.org/show_bug.cgi?id=164366 |
I took a stab at hacking around this on our side as well: #14079 |
Description
In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words.
For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake.
This pull request add the word to the list and remove the duplication.
For reference the same word can be found on the Portuguese stopwords.txt on line 34
Fix #14065