nb: lemmatization of copula AUX #4735
Replies: 3 comments
-
Some intertwined issues here:
I think we'll have to plan the changes to spacy itself for v3 when we hope to update the UD training data to UD v2.5, which will bring up the AUX issues in other languages, too. If you want to work with spacy v2 but use newer/different data right now, you can modify the lemmatizer in the https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data It's not an elegant solution, but I think you could temporarily add the AUX forms as exceptions and then it would be similar to English. The lemmatizer tables are serialized with the vocab when you save a trained model, so you can make local changes to |
Beta Was this translation helpful? Give feedback.
-
I see! Thanks for clarifying. Maybe I'll just stick to the older UD data to keep things simple for now. If not, are there any scripts available I could use to generate the files for |
Beta Was this translation helpful? Give feedback.
-
There are no scripts, but the data should basically already be there, either in Actually, the pronoun lemma above is different, too, so I'm not 100% sure what's going on. How and where lemmas are set gets pretty complicated... |
Beta Was this translation helpful? Give feedback.
-
If you train an
nb
model on the latest NorNE/UD data, spacy gets the lemmatization of the copula er ("is") wrong, since the data now (correctly) tags it as AUX and not VERB.The current
nb_core_news_sm
model does not have this problem since it was trained on older version of the UD data, before these verbs were since changed to use the AUX tag for copulas.How to reproduce the behaviour
Example sentence: Hun er statsminister ("She is prime minister")
The correct lemma for copula er ("is") would be være ("to be").
The current
nb_core_news_sm
model finds the correct lemma, since it is tagged as VERB and presumably it's then found in thelemma_exc
lookup table:Using a model trained on the updated data, the lemma becomes incorrect.
We can also see that the lemmatizer still gets the lemma of the VERB correctly, but not AUX:
Solution?
What is the best way to solve this? The English model seems to get it right for some reason, even though the
lemmatizer
has the same problem with AUX vs VERB:A naive solution would be to simply treat AUX as a VERB in the lemmatizer, but it feels a bit invasive.
Your Environment
Info about spaCy
Beta Was this translation helpful? Give feedback.
All reactions