Extracting and Identifying locations #13359

vrunm · 2024-02-29T08:39:29Z

vrunm
Feb 29, 2024

My goal is to be able to recognize (aka identify) and identify (aka name, retrieve an ID) locations from text using NLP. I'm using Spacy specifically.

There are about 1,000 possible locations, but the difficulty is that they are unlikely to be written in a fully qualified way, not to mention spelling mistakes and aliases. For example, the Mission neighborhood in San Francisco written in a fully-qualified way might be (1) Mission (2) City of San Francisco (3) San Francisco County (4) California (5) US. (The numbers are just to illustrate the separate pieces.) However, many people might write it as (1) Mission (2) City of San Francisco, or (1) Mission, or (1) Mission (2) City of San Francisco (4) California. (Not to mention that #1 might be called "Mission District", #2 might be called "San Francisco", #4 might be "CA", etc.)

So my goal is to be able have an ID for "Mission" and all other neighborhoods, and ID for California and some other states, etc. If the text is like Mission, San Francisco, CA then I get the Mission ID. If the text is like San Francisco, CA then I get the San Francisco ID.

It's also easy to create synthetic training data by creating aliases of the individual location pieces (e.g., (a) "City of San Francisco", (b) "San Francisco", (c) "San Francisco City") and permutations of the "name chain" (e.g, 1 + 2 + 3 + 4 + 5, 1 + 2, 1 + 2 + 5, etc) for each alias. Rough estimate is about 50 combinations of alias and name chain per location, or about O(50,000) total values.

So, extraction seems to be a good job for NER. The surrounding text usually has a bit of context (e.g., "Location: ...." or "Comes from ...".

However, I'm unsure about the ability to do identification. My understanding is that much of the NER identification (e.g., Spacy's EntityLinker, which I planned on using) relies on surrounding context. I expect that there will be very little surrounding context that would help disambiguate one of the O(1000) locations from others. I also understand that EntityLinker matches on the token is lookup and not statistical (in other words, the value is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches).

The KnowledgeBase / LookupDB does have a mechanism for setting aliases, so I could add each permutation as an alias. But at that point I feel like I'm not getting any value out of the EntityLinker's statistical models.

If I have to create a gazetteer for the identification aspect, then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

svlandeg · 2024-02-29T13:59:27Z

svlandeg
Feb 29, 2024
Maintainer

Hi!

in other words, the value [of spaCy's EntityLinker] is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches

That's right, and while I was reading about your use-case I came to the same idea that perhaps the EL, as implemented in spaCy's core, is not exactly what you need. It sounds to me like you'll want to leverage more fuzzy-based matching while at the same time maximising the probability of certain terms occurring together, and exploiting the relationship between the different parts of your location, e.g. "state should be in country". These are constraints that you could enforce for this specific use-case, and perhaps then you'd really need to run some kind of multi-constraint optimization framework to find the most optimal and coherent interpretation for each specific location occurrence.

then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

The way you've described the NER step, I do think that you've overcomplicated it. Like you say, there is minimal context like "Location: ...". But for an NER system, especially spaCy's transition-based model, there's not that whole lot of a difference between

Location: Mission, San Francisco

and

Location: San Francisco, CA

Whether a specific part is a street, a state or a country, feels more like a dictionary-based lookup to me, rather than a pure NER challenge. I think I would personally advice to just tag Mission, San Francisco and CA as LOC, then deal with the combination of various LOC entities in post-processing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting and Identifying locations #13359

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extracting and Identifying locations #13359

vrunm Feb 29, 2024

Replies: 1 comment

svlandeg Feb 29, 2024 Maintainer

vrunm
Feb 29, 2024

svlandeg
Feb 29, 2024
Maintainer