nb: lemmatization of copula AUX #4735

jarib · 2019-11-30T14:38:15Z

jarib
Nov 30, 2019

If you train an nb model on the latest NorNE/UD data, spacy gets the lemmatization of the copula er ("is") wrong, since the data now (correctly) tags it as AUX and not VERB.

The current nb_core_news_sm model does not have this problem since it was trained on older version of the UD data, before these verbs were since changed to use the AUX tag for copulas.

How to reproduce the behaviour

Example sentence: Hun er statsminister ("She is prime minister")
The correct lemma for copula er ("is") would be være ("to be").

The current nb_core_news_sm model finds the correct lemma, since it is tagged as VERB and presumably it's then found in the lemma_exc lookup table:

>>> nlp = spacy.load('nb_core_news_sm')
>>> doc = nlp("Hun er statsminister")
>>> [(t.text, t.lemma_, t.pos_) for t in nlp("Hun er statsminister")]
[('Hun', '-PRON-', 'PRON'),
 ('er', 'være', 'VERB'),
 ('statsminister', 'statsminister', 'NOUN')]
>>> nlp.vocab.morphology.lemmatizer('er', 'VERB')
['være']

Using a model trained on the updated data, the lemma becomes incorrect.

>>> nlp = spacy.load('./data/nb-sm-training-norne/model-best')
>>> [(t.text, t.lemma_, t.pos_) for t in nlp("Hun er statsminister")]
[('Hun', 'hun', 'PRON'),
 ('er', 'er', 'AUX'),
 ('statsminister', 'statsminister', 'NOUN')]

We can also see that the lemmatizer still gets the lemma of the VERB correctly, but not AUX:

>>> nlp.vocab.morphology.lemmatizer('er', 'VERB')
['være']
>>> nlp.vocab.morphology.lemmatizer('er', 'AUX')
['er']
>>>

Solution?

What is the best way to solve this? The English model seems to get it right for some reason, even though the lemmatizer has the same problem with AUX vs VERB:

>>> nlp = spacy.load('en_core_web_sm')
>>>[(t.text, t.lemma_, t.pos_) for t in nlp("She is prime minister")]
[('She', '-PRON-', 'PRON'),
 ('is', 'be', 'AUX'),
 ('prime', 'prime', 'PROPN'),
 ('minister', 'minister', 'PROPN')]
>>> nlp.vocab.morphology.lemmatizer('is', 'AUX')
['is']
>>> nlp.vocab.morphology.lemmatizer('is', 'VERB')
['be']

A naive solution would be to simply treat AUX as a VERB in the lemmatizer, but it feels a bit invasive.

diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py
index d70e4cfc4..d90aa0fb2 100644
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@@ -3,7 +3,7 @@ from __future__ import unicode_literals
 
 from collections import OrderedDict
 
-from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN
+from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN, AUX
 from .errors import Errors
 from .lookups import Lookups
 
@@ -45,7 +45,7 @@ class Lemmatizer(object):
             return [lookup_table.get(string, string)]
         if univ_pos in (NOUN, "NOUN", "noun"):
             univ_pos = "noun"
-        elif univ_pos in (VERB, "VERB", "verb"):
+        elif univ_pos in (VERB, "VERB", "verb", AUX, "AUX", "aux"):
             univ_pos = "verb"
         elif univ_pos in (ADJ, "ADJ", "adj"):
             univ_pos = "adj"

Your Environment

Info about spaCy

spaCy version: 2.2.3
Platform: Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
Python version: 3.6.5

adrianeboyd · 2019-12-02T08:46:49Z

adrianeboyd
Dec 2, 2019

Some intertwined issues here:

In general, it looks like the lemmatizer should be updated to allow rules for more/all UD POS tags present in the lookup rules. This is an unnecessarily limited list and the proper noun exception is oversimplified.
The current version of the code is intended to work with the current training data for the provided models, so we generally don't want to make changes like this unless the training data and models are being updated, too. If we make these kinds of changes now, it could cause problems for existing v2 models.

I think we'll have to plan the changes to spacy itself for v3 when we hope to update the UD training data to UD v2.5, which will bring up the AUX issues in other languages, too.

If you want to work with spacy v2 but use newer/different data right now, you can modify the lemmatizer in the nb files from this repo:

https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data

It's not an elegant solution, but I think you could temporarily add the AUX forms as exceptions and then it would be similar to English. The lemmatizer tables are serialized with the vocab when you save a trained model, so you can make local changes to spacy-lookups-data that will end up saved in your trained models -- no need to wait for the spacy packages to be updated to produce models that you can share with others. (Same goes for the tag maps, too.)

0 replies

jarib · 2019-12-02T09:18:02Z

jarib
Dec 2, 2019
Author

I see! Thanks for clarifying.

Maybe I'll just stick to the older UD data to keep things simple for now. If not, are there any scripts available I could use to generate the files for spacy-lookups-data from the updated UD data, instead of hand-editing?

0 replies

adrianeboyd · 2019-12-03T11:42:15Z

adrianeboyd
Dec 3, 2019

There are no scripts, but the data should basically already be there, either in lookup or exc (possibly in poor quality for lookup) or also in the lemma column of the UD data (though possibly not full lists for all paradigms).

Actually, the pronoun lemma above is different, too, so I'm not 100% sure what's going on. How and where lemmas are set gets pretty complicated...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nb: lemmatization of copula AUX #4735

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

nb: lemmatization of copula AUX #4735

jarib Nov 30, 2019

How to reproduce the behaviour

Solution?

Your Environment

Info about spaCy

Replies: 3 comments

adrianeboyd Dec 2, 2019

jarib Dec 2, 2019 Author

adrianeboyd Dec 3, 2019

jarib
Nov 30, 2019

adrianeboyd
Dec 2, 2019

jarib
Dec 2, 2019
Author

adrianeboyd
Dec 3, 2019