Portuguese words starting with a capital letter are not correctly lemmatized #7638
Replies: 4 comments 2 replies
-
Hi! I'll move this to the discussion board - this issue will be closed but there will be a link/forward to the open thread. |
Beta Was this translation helpful? Give feedback.
-
It's 2023 and I came across this problem as well, apparently unresolved. Sentence to reproduce
|
Beta Was this translation helpful? Give feedback.
-
Please consider that spaCy is an open-source toolbox, provided to you for free for any usage (commercial, research or otherwise) and that we spend a LOT of time going through our discussion forums and issue tracker to help users with questions (again, for free) and resolve bugs. Your original post "It's 2023 and I ..." comes across as condescending, which generally is not helpful if you want someone to help you. OS maintenance is a lot of work, and we ask you to be mindful and respectul of that. To your questions:
This thread wasn't closed. It was moved to a discussion because it's not a bug. This thread is not a bug because errors made by a ML model are not considered to be bugs, as ML models can never be 100% correct or predictable after retraining. On Github, moving a thread from the Issue Tracker to the Discussion forum can only be done by closing the issue and creating a new thread on the Discussion forum. That's what we did.
This thread was tagged as "feat / lemmatizer", indicating this thread is about spaCy's lemmatization feature, which is correct. It helps us group and find similar issues using this internal classification system.
Unfortunately ML models will always make mistakes. Often, it would require changing/extending/fixing the training data to fix such behaviour. Let's have a look at one of the original examples provided by the person who started this topic:
With spaCy 3.7.2, this gives me lemma "Feito" using If you want to have more control over the lemmatization, you can consider using a rule-based lemmatizer instead:
And follow the advice from the thread you linked. For the future, incorrect ML predictions can be reported in the following master thread: #3052 |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. My apologies, it certainly wasn't intended to be condescending, I was just expressing confusion about why the issue was unresolved from 2021.
The lemma I receive for "Trabalharam" is the word unchanged, "Trabalharam" using pt_core_news_lg. If I lower case this word the correct lemma "trabalhar" is given. Similarly with "Fale", the lemma is given as "Fale" and lower casing gives the correct lemma "falar".
I am already using pt_core_news_lg as I mentioned in the original post, and I get these errors. p.s. Thank you for the pointer to the master thread. This comment demonstrates the same problem in Portuguese with a verb with a capitalized first letter (in this case, "Reserve"). |
Beta Was this translation helpful? Give feedback.
-
How to reproduce the behaviour
And it is not exclusively when starting a sentence:
It should return "Fazer" as in:
Your Environment
My example includes a verb, but I've also tested using nouns and the same bug occurs.
Beta Was this translation helpful? Give feedback.
All reactions