Update stopwords.txt #14075

eusousu · 2024-12-17T13:29:56Z

Description

In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words.

For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake.

This pull request add the word to the list and remove the duplication.

For reference the same word can be found on the Portuguese stopwords.txt on line 34

Fix #14065

In brazillian portuguese the conjuntion "em(preposition)+<a|o|as|os>(article)" take the form "na, nas, no, nos" being commom stop words. For some reason the "nas" conjunction appear twice and the "no" is nowhere to be found, I think it was probably a mistake. This pull request add the word to the list and remove the duplication.

benwtrent · 2024-12-17T14:34:12Z

While its a simple change, it does change the analysis chain. I wonder if it should stick to Lucene 11 (admittedly, that will not be shipped for a LONG time).

I wonder what others think on if we should backport to 10.2?

eusousu · 2024-12-17T14:53:52Z

Is there something I can help investigate further? I tried parsing the error and it seems to relate to a mn_MN dictionary that refers to Mongolian 🤔

While checking /home/runner/work/lucene/lucene/lucene/analysis/common/build/hunspell-regressions/libreoffice/mn_MN/mn_MN.aff:
  2> java.lang.NumberFormatException: For input string: "ээв$"
  2> 	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  2> 	at java.base/java.lang.Integer.parseInt(Integer.java:662)
  2> 	at java.base/java.lang.Integer.parseInt(Integer.java:778)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.parseNum(Dictionary.java:594)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.readAffixFile(Dictionary.java:408)
  2> 	at org.apache.lucene.analysis.hunspell.Dictionary.<init>(Dictionary.java:264)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries$1.<init>(TestAllDictionaries.java:75)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.loadDictionary(TestAllDictionaries.java:75)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.lambda$testDictionariesLoadSuccessfully$7(TestAllDictionaries.java:165)
  2> 	at org.apache.lucene.analysis.hunspell.TestAllDictionaries.lambda$testDictionariesLoadSuccessfully$8(TestAllDictionaries.java:187)
  2> 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  2> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  2> 	at java.base/java.lang.Thread.run(Thread.java:1583)

I am unable to understand how my change could impact the analysis on the Mongolian language.

msokolov · 2024-12-17T16:13:40Z

I am unable to understand how my change could impact the analysis on the Mongolian language.

I don't understand the connection to your change either, but it looks to me as if Mongolian is coming in via the randomized test framework, which sets the active Java Locale to something random. If you were to capture the full "reproduce with" command line that is output by the test framework when there is a failure, we could tell if that's what's going on. Are you able to reproduce the failure?

rmuir · 2024-12-17T16:44:44Z

I reran that check to see what happens, if it reproduces.

This "extra regressions" check is also doing unpinned shallow git clone of external dictionaries repositories, so they could make a commit to a dictionary in those repos that causes a failure (in theory).

I guess it is also possible it could fail (e.g. network issue) in such a way that the clone is unsuccessful? It is unclear to me if it is really cloning with the git binary, or if it is doing some pure-java (jgit etc) clone that might be less robust.

And yeah, it could also be a randomization issue such as locale/charset.

rmuir · 2024-12-17T16:47:21Z

It fails again. maybe problem comes from LibreOffice/dictionaries@d169602 ?

rmuir · 2024-12-17T16:49:14Z

And yeah i see the commit date, but that's not the push date. So I suspect this issue has nothing to do with your PR and may fail all PRs until we address it.

eusousu · 2024-12-17T17:09:14Z

If you were to capture the full "reproduce with" command line that is output by the test framework

You mean this?

Reproduce with: gradlew :lucene:analysis:common:test --tests "org.apache.lucene.analysis.hunspell.TestAllDictionaries.testDictionariesLoadSuccessfully" -Ptests.jvms=2 -Ptests.jvmargs= -Ptests.seed=18B87D355DA5CF0F -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false

rmuir · 2024-12-17T17:32:57Z

I'm happy to try to debug this but it might be a few days. Issue may be with REP rules in the referenced commit.
the way these rules work are:

REP 3619
REP a а
REP c с
... (3617 more times)

if the count (3619) is incorrect, then parser might try to parse a rule (such as REP ээв$ ээ_вэ) as a another count, leading to the NumberFormatException. maybe the C hunspell parser is lenient, so nobody has yet noticed. In my experiences the dictionaries often have such problems, hence the testing here.

It could also be some other bug in the code, this is just what it looks like to me.

rmuir · 2024-12-17T17:42:17Z

Yes, that's it. the REP 3619 should be changed to REP 3621. I guess we could send the PR to libreoffice, since the "upstream" dictionary looks totally different here: https://github.com/bataak/dict-mn/blob/main/mn_MN/mn_MN.aff#L49

rmuir · 2024-12-17T17:46:11Z

I will send them a one-liner PR explaining the situation, we can take it from there.

We may want to separately try to be more lenient about this part of the parsing. Have not looked at the C implementation, but I'm guessing the numbers might be used as more of a "hint" to size hashtables or something, and not really a hard rule?

rmuir · 2024-12-17T17:53:12Z

LibreOffice/dictionaries#46

eusousu · 2024-12-17T19:34:09Z

LibreOffice/dictionaries#46

Their contribution process seem to be elsewhere, but I could not understand it fully 😅

I tried sending the issue to their IRC channel, hope someone sees it.

rmuir · 2024-12-18T04:53:32Z

Thank you @eusousu

I tried to make some progress, for now at least I have an open bug report: https://bugs.documentfoundation.org/show_bug.cgi?id=164366

rmuir · 2024-12-18T05:33:13Z

I took a stab at hacking around this on our side as well: #14079

benwtrent added this to the 11.0.0 milestone Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update stopwords.txt #14075

Update stopwords.txt #14075

eusousu commented Dec 17, 2024 •

edited

Loading

benwtrent commented Dec 17, 2024

eusousu commented Dec 17, 2024 •

edited

Loading

msokolov commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

eusousu commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

eusousu commented Dec 17, 2024

rmuir commented Dec 18, 2024

rmuir commented Dec 18, 2024

Update stopwords.txt #14075

Are you sure you want to change the base?

Update stopwords.txt #14075

Conversation

eusousu commented Dec 17, 2024 • edited Loading

Description

benwtrent commented Dec 17, 2024

eusousu commented Dec 17, 2024 • edited Loading

msokolov commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

eusousu commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

rmuir commented Dec 17, 2024

eusousu commented Dec 17, 2024

rmuir commented Dec 18, 2024

rmuir commented Dec 18, 2024

eusousu commented Dec 17, 2024 •

edited

Loading

eusousu commented Dec 17, 2024 •

edited

Loading