Matcher on lower lemma #5630

sachinruk · 2020-06-23T01:28:17Z

sachinruk
Jun 23, 2020

How to reproduce the behaviour

This is a crosspost from SO.

I wish to match the phrase "Education program" but with any word in between. So suppose I have the following text:

text1 = "Education is a way to program life. This sentence has nothing to do with education"
text2 = 'This account was created by a prior staff member for our county Tobacco Education Program.'

And I set the pattern to be:

pattern = [{'LOWER': {'LEMMA': 'education'}}, {'IS_SENT_START': False, 'OP': '*'},{'LOWER': {'LEMMA': 'program'}}]

However, when I do the following I get an insane amount of matches for the above:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add('edu', None, pattern)
doc = nlp(text1)
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(f"String ID: {string_id}\nText: {span.text}")

Just wondering what I'm doing wrong here? I tried switching LOWER and LEMMA in pattern as well, without any luck.

Proposed Solution

So as suggested in the answer, if I simply do pattern = [{'LEMMA': 'education'}, {'IS_SENT_START': False, 'OP': '*'}, {'LEMMA': 'program'}] it works for text1 but not text2. The only way I can get that to work is if I do text2.lower(). However, most of my texts are longer than one sentence, so I'm worried this workaround (which is ok for now) will affect performance.

Your Environment

Operating System: Macos
Python Version Used: 3.7
spaCy Version Used: 2.2.4

Answered by adrianeboyd

Jun 25, 2020

This is more of a usage question, so let's keep this on SO. My answer is copied below for reference:

{'LOWER': {'LEMMA': 'education'}} isn't a valid pattern, and unless you turn on validation (see below), the Matcher silently discards ill-formed attributes, so in effect this pattern is treated like {}, which matches any token, which is why you get so many results.

You can use either

{'LOWER': 'education'}
{'LEMMA': 'education'}

but they can't be nested.

Use Matcher(nlp.vocab, validate=True) for more thorough validation when writing patterns. (It's off by default because it makes adding patterns a lot slower.)

View full answer

adrianeboyd · 2020-06-25T12:19:29Z

adrianeboyd
Jun 25, 2020

This is more of a usage question, so let's keep this on SO. My answer is copied below for reference:

{'LOWER': {'LEMMA': 'education'}} isn't a valid pattern, and unless you turn on validation (see below), the Matcher silently discards ill-formed attributes, so in effect this pattern is treated like {}, which matches any token, which is why you get so many results.

You can use either

{'LOWER': 'education'}
{'LEMMA': 'education'}

but they can't be nested.

Use Matcher(nlp.vocab, validate=True) for more thorough validation when writing patterns. (It's off by default because it makes adding patterns a lot slower.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher on lower lemma #5630

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Matcher on lower lemma #5630

sachinruk Jun 23, 2020

How to reproduce the behaviour

Proposed Solution

Your Environment

Replies: 1 comment

adrianeboyd Jun 25, 2020

sachinruk
Jun 23, 2020

adrianeboyd
Jun 25, 2020