Recognizing concatenated entities #13202
Replies: 1 comment 1 reply
-
This looks like it's primarily a tokenization problem. If you have "EdinburghLondon" as one token, spacy can only assign one entity label to this token. When training the NER model, it can also only know about one label on this token, so there's no technical way for the NER model to predict two labels or split the token. I don't know how realistic these examples are vs. your real data, but you might able to add a rule to the tokenizer to split e.g. in between spaCy/spacy/lang/fr/punctuation.py Lines 49 to 51 in 764be10 If the real data has much more difficult tokenization issues, you might consider preprocessing your texts outside of spacy (whitespace or punctuation normalization, truecasing, etc.) or using an alternate tokenizer instead of trying to solve this with the rule-based tokenizer. Once you fix the tokenization problem, you might also run into an annotation scheme problem, since an NER scheme with |
Beta Was this translation helpful? Give feedback.
-
Hello,
My team is using prodigy for labelling, and we do label data based on strict characters rather whole tokens. Sometimes spelling mistakes appear in our data as most often words are concatenated together:
Example:
Hello Mike, how is the weather inLondon ?
or even more often:
Hello Mike, I am doing the EdinburghLondon trail, care to join ?
where we have two entities written together.
Is there a way to tell the NER to be stricter about those examples ? I've done a test where I've purposely concatenated words and train a NER model for recognition, my thought was that if I give the model enough examples it will learn, but unfortunately that was not the case, it always picks the first location so in the case of "EdinburghLondon" we have LOC1 and LOC2 but the NER will output LOC1: "EdinburghLondon" and no LOC2.
Beta Was this translation helpful? Give feedback.
All reactions