Tokenizing named entities as a single token #3259
-
Feature descriptionHi There, I'm looking to tokenize documents so that named entities appear as single tokens for onward vector representation. For example: doc = 'New York is a city in the United States of America" Would be tokenized as: ['New York', 'is', 'a', 'city', 'in', 'the', 'United States of America'] Do you have a way of doing this please? Steve |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Named entities are from spacy.pipeline import merge_entities
nlp = spacy.load("en_core_web_sm") # or any other model
nlp.add_pipe(merge_entities) |
Beta Was this translation helpful? Give feedback.
Named entities are
Span
objects, so you can iterate over thedoc.ents
and then merge them into a single token. spaCy also ships with a handy component you can plug into your pipeline that takes care of this automatically: