Unable to retrieve "ent_type" for named entity when word is lowercase #2801
-
DescriptionHello, I'm having an issue with Spacy This is proving an issue when trying to work with a paragraph of text, which is including brand names, or the names of celebrities for example - but not being identified as such due to being lowercase. It feels that a word like How to reproduce the behaviour
Output
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think what you're experiencing here comes down to the different predictions the model makes based on the capitalization of the word. The Entity types are not stored with the vocab, because they're context-dependent – so it's definitely possible that a word is part of the vocab but in its current context, it's not recognised as an entity. The process of recognising named entities is statistical – so the model has no deeper knowledge of what a an "organization" is or how it's defined. It only predicts what label is most likely in this context and which span to apply it to, based on the data the model was trained on. Entities are also recognized in context – so the surrounding text matters and It's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models. So if you're dealing with a lot of lowercase text, you probably want to update the model to include more lowercase texts with entities. This should be pretty easy to do, because you can create lots of training data programmatically: parse a lot of regular text and extract the entity spans, filter out the entity types you need, copy the data and lowercase it all. Then train the model with both types of examples to make it less sensitive to capitalisation. |
Beta Was this translation helpful? Give feedback.
I think what you're experiencing here comes down to the different predictions the model makes based on the capitalization of the word. The
ent_type_
is the value of the token's label that's predicted by the named entity recognizer. So if a token is not part of a recognized entity, it also won't have anent_type
orent_type_
.Entity types are not stored with the vocab, because they're context-dependent – so it's definitely possible that a word is part of the vocab but in its current context, it's not recognised as an entity. The process of recognising named entities is statistical – so the model has no deeper knowledge of what a an "organization" is or how it's defined. It only predicts wh…