Custom POS pipeline #5417

AstralWatcher · 2020-05-08T02:25:45Z

AstralWatcher
May 8, 2020

Greetings,
It is my first time using spacy.
I was looking up the documentation because I was trying to change POS tag for every tokens in doc object. I tried changing it directly however it does not allow it, later found out why, however I found a way to change it somehow here by creating a new document object
I am using a custom library for POS tagging and just want to change for every token the token.pos_ tag.
So far I tried the code bellow:

import your_custom_model
from spacy.symbols import POS
from spacy.tokens import Doc
import numpy
def custom_model_wrapper(doc):
    words = [token.text for token in doc]
    pos = your_custom_model(words)
    # Convert the strings to integers and add them to the string store
    pos = [doc.vocab.strings.add(label) for label in pos]
    # Create a new Doc from a numpy array
    attrs = [POS]
    arr = numpy.array(list(zip(pos)), dtype="uint64")
    new_doc = Doc(doc.vocab, words=words).from_array(attrs, arr)
    return new_doc

And after that I just tried to access like:

for token in new_doc:
        print(token.pos_)

With code above I get a Error KeyError: 1524697805 in File "token.pyx", line 879, in spacy.tokens.token.Token.pos_.get
Somehow the the dictionary does not have the int value of the label? I thought it should be added with the call doc.vocab.strings.add(label)? or am i missing something? Like how is the POS value mapped to a particular token? Furthermore, I checked the length of words and pos lists, they have the same length.
I added the custom wrapper with:

nlp.add_pipe(custom_model_wrapper, name="custom_pos_tagger", first=True)

The model is a blank xx_ent_wiki_sm, then trained it with POS+WORD for NER, and it only has ner pipeline besides the custom added pipeline.

Additional question do I need a tokenizer pipeline? For NER to work properly?

Environment Information:

spaCy version: 2.2.4
Platform: Windows-10-10.0.17763-SP0
Python version: 3.7.4*
Dev platform: Pycharm 2019.3.3

Answered by adrianeboyd

May 8, 2020

A spacy token has two attributes that store POS information, the fine-grained tag (tag) and the coarse-grained Universal POS (pos). The tag can be any tag, but the pos is restricted to being a UPOS tag from this tag set: https://universaldependencies.org/u/pos/index.html. (I looked again through the related docs and this should be explained more clearly!)

For custom tags, you want to use tag instead of pos. Trying to use Doc.from_array as a workaround is just kind of masking the underlying problem. As a check, you should be able to set any attribute in the Doc directly that you also want to load with Doc.from_array.

View full answer

adrianeboyd · 2020-05-08T07:49:20Z

adrianeboyd
May 8, 2020

A spacy token has two attributes that store POS information, the fine-grained tag (tag) and the coarse-grained Universal POS (pos). The tag can be any tag, but the pos is restricted to being a UPOS tag from this tag set: https://universaldependencies.org/u/pos/index.html. (I looked again through the related docs and this should be explained more clearly!)

For custom tags, you want to use tag instead of pos. Trying to use Doc.from_array as a workaround is just kind of masking the underlying problem. As a check, you should be able to set any attribute in the Doc directly that you also want to load with Doc.from_array.

0 replies

AstralWatcher · 2020-05-08T18:32:40Z

AstralWatcher
May 8, 2020
Author

Thanks for the quick replay,
I was using custom POS tags, so it does not conform with UPOS tags rules, however I didn't know that was the problem, now I know.
Worked when I changed it to for loop.

 pos = [doc.vocab.strings.add(label) for label in pos]
 for index, token in enumarate(doc):
    token.tag = pos[index]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom POS pipeline #5417

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Custom POS pipeline #5417

AstralWatcher May 8, 2020

Replies: 2 comments

adrianeboyd May 8, 2020

AstralWatcher May 8, 2020 Author

AstralWatcher
May 8, 2020

adrianeboyd
May 8, 2020

AstralWatcher
May 8, 2020
Author