Matcher on lower lemma #5630
-
How to reproduce the behaviourThis is a crosspost from SO. I wish to match the phrase "Education program" but with any word in between. So suppose I have the following text:
And I set the pattern to be: pattern = [{'LOWER': {'LEMMA': 'education'}}, {'IS_SENT_START': False, 'OP': '*'},{'LOWER': {'LEMMA': 'program'}}] However, when I do the following I get an insane amount of matches for the above: import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
matcher.add('edu', None, pattern)
doc = nlp(text1)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(f"String ID: {string_id}\nText: {span.text}") Just wondering what I'm doing wrong here? I tried switching LOWER and LEMMA in Proposed SolutionSo as suggested in the answer, if I simply do Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is more of a usage question, so let's keep this on SO. My answer is copied below for reference:
You can use either
but they can't be nested. Use |
Beta Was this translation helpful? Give feedback.
This is more of a usage question, so let's keep this on SO. My answer is copied below for reference:
{'LOWER': {'LEMMA': 'education'}}
isn't a valid pattern, and unless you turn on validation (see below), theMatcher
silently discards ill-formed attributes, so in effect this pattern is treated like{}
, which matches any token, which is why you get so many results.You can use either
but they can't be nested.
Use
Matcher(nlp.vocab, validate=True)
for more thorough validation when writing patterns. (It's off by default because it makes adding patterns a lot slower.)