How to improve POS tagging and lemmatizing? #12431
Replies: 1 comment
-
Thanks for your question! The @Language.factory(
"fixup_lemmatizer",
assigns=["token.lemma"],
default_config={
"model": None,
"mode": "fixup",
"overwrite": True,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0},
)
def make_fixup_lemmatizer(
nlp: Language,
model: Optional[Model],
name: str,
mode: str,
overwrite: bool,
scorer: Optional[Callable],
):
return FixupLemmatizer(
nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer
)
class FixupLemmatizer(Lemmatizer):
def fixup_lemmatize(self, token: Token) -> List[str]:
# Your custom lemmatization logic.
... You should then be able to configure the lemmatizer as follows: pipeline = ["tok2vec","morphologizer","tagger","parser","lemmatizer","senter","attribute_ruler","ner", "fixup_lemmatizer"]
[components.fixup_lemmatizer]
factory = "fixup_lemmatizer"
mode = "fixup"
overwrite = true When the lemmatizer mode is not For more information, you can also consult this earlier discussion, which contains more information. |
Beta Was this translation helpful? Give feedback.
-
Our tool uses (Dutch) texts as input. These texts are processed using Spacy of POS tagging and lemmatizing. For this we use the nl_core_news_lg language model. While POS tagging and lemmatizing in most cases works perfect, we sometimes see tokens that are tagged and/or lemmatized wrong. Because the lemmatizing is import for the further use of our tool I currently have a python dictionary that consists of the original word (which gets incorrectly lemmatized by the model) and the correct lemma. I can imagine there are better ways to correct or even improve the POS tagging and lemmatizing, but I haven't figured out what a solid workflow for this would look like. If I understand correctly the nl_core_news_lg model uses a trained lemmatizer. One approach I thought about is adding a rule or lookup based lemmatizer in case the trained lemmatizer fails, but I don't know whether this makes sense or if there are better approaches for this. Any suggestions?
Beta Was this translation helpful? Give feedback.
All reactions