Recognizing concatenated entities #13202

whranclast · 2023-12-18T12:12:52Z

whranclast
Dec 18, 2023

Hello,

My team is using prodigy for labelling, and we do label data based on strict characters rather whole tokens. Sometimes spelling mistakes appear in our data as most often words are concatenated together:

Example:

Hello Mike, how is the weather inLondon ?

or even more often:

Hello Mike, I am doing the EdinburghLondon trail, care to join ?

where we have two entities written together.

Is there a way to tell the NER to be stricter about those examples ? I've done a test where I've purposely concatenated words and train a NER model for recognition, my thought was that if I give the model enough examples it will learn, but unfortunately that was not the case, it always picks the first location so in the case of "EdinburghLondon" we have LOC1 and LOC2 but the NER will output LOC1: "EdinburghLondon" and no LOC2.

adrianeboyd · 2023-12-19T07:13:04Z

adrianeboyd
Dec 19, 2023

This looks like it's primarily a tokenization problem. If you have "EdinburghLondon" as one token, spacy can only assign one entity label to this token. When training the NER model, it can also only know about one label on this token, so there's no technical way for the NER model to predict two labels or split the token.

I don't know how realistic these examples are vs. your real data, but you might able to add a rule to the tokenizer to split e.g. in between [a-z][A-Z] in the middle of a token, which you could do with an infix rule with a zero-width infix. Here's an example for French that splits after the ' for [A-z]'[A-z]:

spaCy/spacy/lang/fr/punctuation.py

Lines 49 to 51 in 764be10

    
           _infixes = TOKENIZER_INFIXES + [ 
        
               r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION) 
        
           ]

If the real data has much more difficult tokenization issues, you might consider preprocessing your texts outside of spacy (whitespace or punctuation normalization, truecasing, etc.) or using an alternate tokenizer instead of trying to solve this with the rule-based tokenizer.

Once you fix the tokenization problem, you might also run into an annotation scheme problem, since an NER scheme with LOC1 and LOC2 might not be the best design for this task. The NER model is designed to be able to identify LOC vs. not LOC, but identifying the first or second occurrence in a sentence or identifying the role in the sentence (like DESTINATION) doesn't always work well. Typically you'd have the NER model find all LOC instances and another component could classify these based on their role in each particular sentence or text. But I've only seen the two examples above and maybe this scheme works fine in practice, it's just something to keep in mind as you go forward.

1 reply

whranclast Dec 21, 2023
Author

Thank you very much @adrianeboyd I believe that was sufficient enough. I had taken into consideration the two model approach, was just generally asking if with enough example and stricter labelling the model itself would recognize the two LOC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognizing concatenated entities #13202

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Recognizing concatenated entities #13202

whranclast Dec 18, 2023

Replies: 1 comment · 1 reply

adrianeboyd Dec 19, 2023

whranclast Dec 21, 2023 Author

whranclast
Dec 18, 2023

Replies: 1 comment 1 reply

adrianeboyd
Dec 19, 2023

whranclast Dec 21, 2023
Author