Unable to retrieve "ent_type" for named entity when word is lowercase #2801

mrjamesriley · 2018-09-25T18:42:53Z

mrjamesriley
Sep 25, 2018

Description

Hello, I'm having an issue with Spacy Token's missing the ent_type when a word is lowercase. In the example below, we can see howsony is a word within the NLP Vocab, but we only get theent_type returned when the word is fed to Spacy in titlecase format.

This is proving an issue when trying to work with a paragraph of text, which is including brand names, or the names of celebrities for example - but not being identified as such due to being lowercase. It feels that a word like sony or eminem should either be not considered in the vocab while in lowercase form - or if they are considered in the vocab, then the lowercase forms should return the correct ent_type format.

How to reproduce the behaviour

import spacy

nlp = spacy.load('en_core_web_lg')

word = 'sony'
word_capitalised = 'Sony'

print("Lower case in vocab: " + str((word in nlp.vocab)))
print("Title case in vocab: " + str((word_capitalised in nlp.vocab)))

print("Lower case ent type: " + nlp(word)[0].ent_type_)
print("Title case ent type: " + nlp(word_capitalised)[0].ent_type_)

Output

Lower case in vocab: True
Title case in vocab: True
Lower case ent type:
Title case ent type: ORG

Your Environment

Operating System: Mac OSX 10.13.6
Python Version Used: 3.6.5
spaCy Version Used: 2.0.12

Answered by ines

Sep 27, 2018

I think what you're experiencing here comes down to the different predictions the model makes based on the capitalization of the word. The ent_type_ is the value of the token's label that's predicted by the named entity recognizer. So if a token is not part of a recognized entity, it also won't have an ent_type or ent_type_.

Entity types are not stored with the vocab, because they're context-dependent – so it's definitely possible that a word is part of the vocab but in its current context, it's not recognised as an entity. The process of recognising named entities is statistical – so the model has no deeper knowledge of what a an "organization" is or how it's defined. It only predicts wh…

View full answer

ines · 2018-09-27T10:23:44Z

ines
Sep 27, 2018
Maintainer

I think what you're experiencing here comes down to the different predictions the model makes based on the capitalization of the word. The ent_type_ is the value of the token's label that's predicted by the named entity recognizer. So if a token is not part of a recognized entity, it also won't have an ent_type or ent_type_.

Entity types are not stored with the vocab, because they're context-dependent – so it's definitely possible that a word is part of the vocab but in its current context, it's not recognised as an entity. The process of recognising named entities is statistical – so the model has no deeper knowledge of what a an "organization" is or how it's defined. It only predicts what label is most likely in this context and which span to apply it to, based on the data the model was trained on. Entities are also recognized in context – so the surrounding text matters and nlp(word) usually won't produce good results, especially considering the model was trained on general news and web text.

It's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models.

So if you're dealing with a lot of lowercase text, you probably want to update the model to include more lowercase texts with entities. This should be pretty easy to do, because you can create lots of training data programmatically: parse a lot of regular text and extract the entity spans, filter out the entity types you need, copy the data and lowercase it all. Then train the model with both types of examples to make it less sensitive to capitalisation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve "ent_type" for named entity when word is lowercase #2801

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unable to retrieve "ent_type" for named entity when word is lowercase #2801

mrjamesriley Sep 25, 2018

Description

How to reproduce the behaviour

Output

Your Environment

Replies: 1 comment

ines Sep 27, 2018 Maintainer

mrjamesriley
Sep 25, 2018

ines
Sep 27, 2018
Maintainer