How to improve POS tagging and lemmatizing? #12431

KenSentMe · 2023-03-16T16:06:55Z

KenSentMe
Mar 16, 2023

Our tool uses (Dutch) texts as input. These texts are processed using Spacy of POS tagging and lemmatizing. For this we use the nl_core_news_lg language model. While POS tagging and lemmatizing in most cases works perfect, we sometimes see tokens that are tagged and/or lemmatized wrong. Because the lemmatizing is import for the further use of our tool I currently have a python dictionary that consists of the original word (which gets incorrectly lemmatized by the model) and the correct lemma. I can imagine there are better ways to correct or even improve the POS tagging and lemmatizing, but I haven't figured out what a solid workflow for this would look like. If I understand correctly the nl_core_news_lg model uses a trained lemmatizer. One approach I thought about is adding a rule or lookup based lemmatizer in case the trained lemmatizer fails, but I don't know whether this makes sense or if there are better approaches for this. Any suggestions?

danieldk · 2023-03-17T13:10:49Z

danieldk
Mar 17, 2023

Thanks for your question! The nl_core_news_lg model indeed uses the trainable lemmatizer. If you want to override some incorrect lemmatizations, adding a lookup or rule-based lemmatizer is a good solution. Note though that by default, the lemmatizer pipe will set the lemma to the word form if there is no matching lookup/rule. So, you could subclass the the lemmatizer class and add the custom behavior that you need. E.g.:

@Language.factory(
    "fixup_lemmatizer",
    assigns=["token.lemma"],
    default_config={
        "model": None,
        "mode": "fixup",
        "overwrite": True,
        "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
    },
    default_score_weights={"lemma_acc": 1.0},
)
def make_fixup_lemmatizer(
    nlp: Language,
    model: Optional[Model],
    name: str,
    mode: str,
    overwrite: bool,
    scorer: Optional[Callable],
):
    return FixupLemmatizer(
        nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
    )

class FixupLemmatizer(Lemmatizer):
    def fixup_lemmatize(self, token: Token) -> List[str]:
        # Your custom lemmatization logic.
        ...

You should then be able to configure the lemmatizer as follows:

pipeline = ["tok2vec","morphologizer","tagger","parser","lemmatizer","senter","attribute_ruler","ner", "fixup_lemmatizer"]

[components.fixup_lemmatizer]
factory = "fixup_lemmatizer"
mode = "fixup"
overwrite = true

When the lemmatizer mode is not lookup or rule, the Lemmatizer class will try to use a method <mode>_lemmatize, which is why setting the mode to fixup here will call the fixup_lemmatize method in the derived class. You could look at the existing lookup_lemmatize method as an example of what the fixup_lemmatize method could look like.

For more information, you can also consult this earlier discussion, which contains more information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve POS tagging and lemmatizing? #12431

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to improve POS tagging and lemmatizing? #12431

KenSentMe Mar 16, 2023

Replies: 1 comment

danieldk Mar 17, 2023

KenSentMe
Mar 16, 2023

danieldk
Mar 17, 2023