Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update stopwords.txt #14075

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update stopwords.txt #14075

wants to merge 1 commit into from

Conversation

eusousu
Copy link

@eusousu eusousu commented Dec 17, 2024

Description

In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words.

For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake.

This pull request add the word to the list and remove the duplication.

For reference the same word can be found on the Portuguese stopwords.txt on line 34

Fix #14065

In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words.

For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake.

This pull request add the word to the list and remove the duplication.
@benwtrent benwtrent added this to the 11.0.0 milestone Dec 17, 2024
@benwtrent
Copy link
Member

While its a simple change, it does change the analysis chain. I wonder if it should stick to Lucene 11 (admittedly, that will not be shipped for a LONG time).

I wonder what others think on if we should backport to 10.2?

@eusousu
Copy link
Author

eusousu commented Dec 17, 2024

Is there something I can help investigate further? I tried parsing the error and it seems to relate to a mn_MN dictionary that refers to Mongolian 🤔

While checking /home/runner/work/lucene/lucene/lucene/analysis/common/build/hunspell-regressions/libreoffice/mn_MN/mn_MN.aff:
  2> java.lang.NumberFormatException: For input string: "ээв$"
  2> 	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  2> 	at java.base/java.lang.Integer.parseInt(Integer.java:662)
  2> 	at java.base/java.lang.Integer.parseInt(Integer.java:778)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.parseNum(Dictionary.java:594)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.readAffixFile(Dictionary.java:408)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.<init>(Dictionary.java:264)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries$1.<init>(TestAllDictionaries.java:75)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.loadDictionary(TestAllDictionaries.java:75)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.lambda$testDictionariesLoadSuccessfully$7(TestAllDictionaries.java:165)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.lambda$testDictionariesLoadSuccessfully$8(TestAllDictionaries.java:187)
  2> 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  2> 	at java.base/java.lang.Thread.run(Thread.java:1583)

I am unable to understand how my change could impact the analysis on the Mongolian language.

@msokolov
Copy link
Contributor

I am unable to understand how my change could impact the analysis on the Mongolian language.

I don't understand the connection to your change either, but it looks to me as if Mongolian is coming in via the randomized test framework, which sets the active Java Locale to something random. If you were to capture the full "reproduce with" command line that is output by the test framework when there is a failure, we could tell if that's what's going on. Are you able to reproduce the failure?

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

I reran that check to see what happens, if it reproduces.

This "extra regressions" check is also doing unpinned shallow git clone of external dictionaries repositories, so they could make a commit to a dictionary in those repos that causes a failure (in theory).

I guess it is also possible it could fail (e.g. network issue) in such a way that the clone is unsuccessful? It is unclear to me if it is really cloning with the git binary, or if it is doing some pure-java (jgit etc) clone that might be less robust.

And yeah, it could also be a randomization issue such as locale/charset.

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

It fails again. maybe problem comes from LibreOffice/dictionaries@d169602 ?

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

And yeah i see the commit date, but that's not the push date. So I suspect this issue has nothing to do with your PR and may fail all PRs until we address it.

@eusousu
Copy link
Author

eusousu commented Dec 17, 2024

If you were to capture the full "reproduce with" command line that is output by the test framework

You mean this?

Reproduce with: gradlew :lucene:analysis:common:test --tests "org.apache.lucene.analysis.hunspell.TestAllDictionaries.testDictionariesLoadSuccessfully" -Ptests.jvms=2 -Ptests.jvmargs= -Ptests.seed=18B87D355DA5CF0F -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

I'm happy to try to debug this but it might be a few days. Issue may be with REP rules in the referenced commit.
the way these rules work are:

REP 3619
REP a а
REP c с
... (3617 more times)

if the count (3619) is incorrect, then parser might try to parse a rule (such as REP ээв$ ээ_вэ) as a another count, leading to the NumberFormatException. maybe the C hunspell parser is lenient, so nobody has yet noticed. In my experiences the dictionaries often have such problems, hence the testing here.

It could also be some other bug in the code, this is just what it looks like to me.

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

Yes, that's it. the REP 3619 should be changed to REP 3621. I guess we could send the PR to libreoffice, since the "upstream" dictionary looks totally different here: https://github.com/bataak/dict-mn/blob/main/mn_MN/mn_MN.aff#L49

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

I will send them a one-liner PR explaining the situation, we can take it from there.

We may want to separately try to be more lenient about this part of the parsing. Have not looked at the C implementation, but I'm guessing the numbers might be used as more of a "hint" to size hashtables or something, and not really a hard rule?

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

@eusousu
Copy link
Author

eusousu commented Dec 17, 2024

LibreOffice/dictionaries#46

Their contribution process seem to be elsewhere, but I could not understand it fully 😅

I tried sending the issue to their IRC channel, hope someone sees it.

@rmuir
Copy link
Member

rmuir commented Dec 18, 2024

Thank you @eusousu

I tried to make some progress, for now at least I have an open bug report: https://bugs.documentfoundation.org/show_bug.cgi?id=164366

@rmuir
Copy link
Member

rmuir commented Dec 18, 2024

I took a stab at hacking around this on our side as well: #14079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing word on Brazillian stop word list
4 participants