How to implement a custom component for multi-phrase matcher? #5010

lingvisa · 2020-02-13T01:26:41Z

lingvisa
Feb 13, 2020

I have a large dictionary of words with their features (or entity type), like below:

term_dict = {'ANIMAL:':['dog', 'pig', 'goat'],
'PLANT': ['apple', 'grass'],
'OBJECT': ['table', 'book', 'pole']}

Given a sentence like: "This is a dog and that is a table". I want to assign 'ANIMAL' and 'OBJECT' to the tokens 'dog' and 'table' respectively, with a custom token attribute 'features'. In the documentation, there is an example that assigns single entity type to a token with a custom NLP component. In my example, there a three types of features (entities) to be recognized and assigned. How to make this possible?

class DictionaryFeatureComponent(object):

    name = "dict_features"  
    def __init__(self, nlp):

        self.matcher = PhraseMatcher(nlp.vocab)
        Token.set_extension("features", default=False)

        term_dict = {'ANIMAL:':['dog', 'pig', 'goat'], 
                      'PLANT': ['apple', 'grass'],
                      'OBJECT': ['table', 'book', 'pole']}

        # Only run nlp.make_doc to speed things up
        p = [text for subterms in term_dict.values() for text in subterms]
        patterns = [nlp.make_doc(text) for text in p]
        self.matcher.add("ANIMAL_FEATURE", None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = doc[start:end]
           
            entity = Span(doc, start, end)
            for token in entity:
                token._.set("features", 'ANIMAL')
        return doc

The above code works for a single entity type assignment, i.e. 'ANIMAL', if the term_dict only stores animal terms. But How to make it work for the three at the same time? The issue I am having is that, in the initializer, it seems each matcher can only works for one type pattern? I store the 3 types in a dict, but then how to make that type information available in the call method?

Ideally, I don't want to create a custom component for each type of entity. That won't be feasible for large number of entity assignment.

adrianeboyd · 2020-02-13T07:26:08Z

adrianeboyd
Feb 13, 2020

Hi, you want to use your category as the key / match_id, with something that looks like:

self.matcher.add("ANIMAL", None, *patterns["ANIMAL"])
self.matcher.add("PLANT", None, *patterns["PLANT"])
...

This is just to show the labels more clearly, for real use you'd have this in a loop more like:

for label in term_dict:
    patterns = [nlp.make_doc(text) for text in term_dict[label]]
    matcher.add(label, callback, patterns)

It could also make sense use a callback to set the custom extension, here's a fairly similar example: https://spacy.io/usage/rule-based-matching#example3

0 replies

lingvisa · 2020-02-13T07:58:03Z

lingvisa
Feb 13, 2020
Author

Then in the call method:

for token in entity:
                token._.set("features", 'ANIMAL')

How to get the feature type? I can't hard-code the 'ANIMAL'. One way I could think of is to use each label as the match_id, create a match_id to label map, then in the call method, I can retrieve the label through the match_id to label map. Is there a better way to do this?

0 replies

lingvisa · 2020-02-13T08:03:00Z

lingvisa
Feb 13, 2020
Author

I think the emojis example resolved my further question above. Thanks.

0 replies

lingvisa · 2020-02-13T08:12:17Z

lingvisa
Feb 13, 2020
Author

Hi, Adriane: I am always not super clear about how the vocab gets initialized. I thought it is initialized through a model, in which all vocabulary in a language are stored and hashed, then it can be looked up. But in this code:

def __init__(self, nlp, companies=tuple(), label="ORG"):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        self.label = nlp.vocab.strings[label]  # get entity label ID

        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        patterns = [nlp(org) for org in companies]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add("TECH_ORGS", None, *patterns)

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension("is_tech_org", default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_tech_org == True.
        Doc.set_extension("has_tech_org", getter=self.has_tech_org)
        Span.set_extension("has_tech_org", getter=self.has_tech_org)

Why this line works?
self.label = nlp.vocab.strings[label]

When and how is the string 'ORG' already stored in the vocab and can be retrieved? I thought you will have to use 'nlp.vocab.strings.add(label)', then you can look it up. In my own implementation in the DictionaryFeatureComponent (I am not using the default English model), I have to add those labels to the vocab store first, then I can get an integer. How does this example work?

0 replies

adrianeboyd · 2020-02-13T13:20:07Z

adrianeboyd
Feb 13, 2020

nlp.vocab.strings[label] will always return the hash for a string. It's the other way around, when you want the string for a hash, that it may or may not have the string stored for that particular hash. But hashing the string is always possible.

What is a bit confusing here is that nlp.vocab.strings[label] just calculates the hash and doesn't add it to the StringStore. If you want to add it to be able to look it by hash up later, you have to use nlp.vocab.strings.add(string).

0 replies

lingvisa · 2020-02-13T17:20:05Z

lingvisa
Feb 13, 2020
Author

If nlp.vocab.strings[label] can always generate a hash value for a string, then why do need to add it first
even for later lookup purpose? Still confused. Also, this is where vocab is initially populated, right?
English:

factory = self.Defaults.create_vocab
vocab = factory(self, **meta.get("vocab", {}))

Japanese:
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)

0 replies

adrianeboyd · 2020-02-13T19:32:31Z

adrianeboyd
Feb 13, 2020

This section of the docs might be helpful: https://spacy.io/usage/spacy-101#vocab

0 replies

lingvisa · 2020-02-13T20:59:37Z

lingvisa
Feb 13, 2020
Author

That link helps. Still a related issue with token attribute extension is this code to generate an error message:

def __init__(self, nlp):

     self.matcher = PhraseMatcher(nlp.vocab)
     self.nlp = nlp
     Token.set_extension("features", default={})
     attribute_word_dict = self._load_dict()

     for label in attribute_word_dict.keys():
         self.nlp.vocab.strings.add(label)

     self.id_label_dict = {}
     for label in attribute_word_dict.keys():
         self.id_label_dict[self.nlp.vocab.strings[label]] = label

     for label, terms in attribute_word_dict.items():
         patterns = [nlp.make_doc(text) for text in terms]
         self.matcher.add(label, None, *patterns)

 def __call__(self, doc):

     matches = self.matcher(doc)
     for match_id, start, end in matches:
         entity = Span(doc, start, end)
         for token in entity:
             token._.set("features", {self.id_label_dict[match_id]})

     return doc

Here two lines are relevant:
Token.set_extension("features", default={})

I intended to create a set of user features for a token, for example, the 'dog' is associated with a number of features:
token._.features = {'animal', 'four_legs'}

I want to use a set instead of a list to store the features because that allows for faster look up. However, this gives an error message, but if I change the 'features' to a list, the issue is gone. The error is below:

Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/nlp/spacy/language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/nlp/spacy/language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "doc.pyx", line 903, in spacy.tokens.doc.Doc.to_bytes
  File "/nlp/spacy/util.py", line 625, in to_bytes
    serialized[key] = getter()
  File "doc.pyx", line 902, in spacy.tokens.doc.Doc.to_bytes.lambda9
  File "/nlp/my-env/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/nlp/my-env/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
TypeError: can not serialize 'set' object
  File "/nlp/spacy/language.py", line 1124, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/nlp/spacy/language.py", line 1124, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "doc.pyx", line 903, in spacy.tokens.doc.Doc.to_bytes
  File "/nlp/spacy/util.py", line 625, in to_bytes
    serialized[key] = getter()
  File "doc.pyx", line 902, in spacy.tokens.doc.Doc.to_bytes.lambda9
  File "/nlp/my-env/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/nlp/my-env/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'set' object
Process Process-4:
Process Process-5:
Process Process-3:
Process Process-6:

0 replies

lingvisa · 2020-02-14T01:03:42Z

lingvisa
Feb 14, 2020
Author

Continuing with my need above, my custom component works:

dict_features = DictionaryFeatureComponent(nlp)
nlp.add_pipe(dict_features)

To test it:

doc = nlp("dog is good")
for token in doc:
      print(t.text, t._.features)

This will print out ['animal', 'four_legs'] for the token 'dog'. Now I want to make use of these features to write patterns, as simple as below:

def extract_relation(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    print(doc[start:end])

  pattern = [{"_": {"features": 'animal'}}]
   matcher = Matcher(nlp.vocab)
   matcher.add('Extractor', extract_relation, pattern)
   doc = nlp("It's a good dog!"
   matches = matcher(doc)

I suppose the callback should print out the word 'dog' since it has the feature 'animal'. However, it prints out the following error:

matches = matcher(doc)
  File "matcher.pyx", line 227, in spacy.matcher.matcher.Matcher.__call__
  File "matcher.pyx", line 287, in spacy.matcher.matcher.find_matches
TypeError: an integer is required

The error occurs at this line:
matches = matcher(doc)

The error comes from line 287, matcher.pyx:

for i, token in enumerate(doc):
        for name, index in extensions.items():
            value = token._.get(name)
            if isinstance(value, basestring):
                value = token.vocab.strings[value]
            extra_attr_values[i * nr_extra_attr + index] = value

I print out and found i, nr_extra_attr, and index are all integers, and value is a list (['animal']), which is my custom attribute extension. Don't understand why it gives this error.

0 replies

adrianeboyd · 2020-02-14T09:14:50Z

adrianeboyd
Feb 14, 2020

msgpack supports lists and dicts, so you could use a dict with your keys and dummy values like True if lookup speed is an issue.

However, I suspect a deeper problem is that features as a list or dict isn't going to be supported by Matcher as a kind of value to match. This pattern is going to check for whether token._.features == "animal", not whether "animal" in token._.features.

pattern = [{"_": {"features": 'animal'}}]

I think you'll need values that are boolean, integer, or string for this to be supported by the Matcher. We have a lot of type validation for the built-in types if you initialize Matcher with validate=True, but there aren't as many checks for the custom extensions because we can't know what types might be used in advance. I'll have to think about how this can be improved because these error messages aren't very helpful...

0 replies

lingvisa · 2020-02-14T17:46:54Z

lingvisa
Feb 14, 2020
Author

If would be very nice if the extension and matcher can support something like this:

pattern = [{"_": {"features": 'animal|four_legs'}}]

meaning that a token can have either 'animal' or 'four_legs' feature.

For now, I can use multiple patterns and use default=False for the extension, and it works:

pattern1 = [{"_": {"features": 'animal'}}]
pattern2 = [{"_": {"features": 'four_legs'}}]

0 replies

lingvisa · 2020-02-14T19:09:35Z

lingvisa
Feb 14, 2020
Author

So far it worked all right. However, just found that when testing on a file, the span.merge() make the pipeline very slow.

class DictionaryFeatureComponent(object):
    name = "dict_features"  # component name, will show up in the pipeline

    def __init__(self, nlp):

        self.matcher = PhraseMatcher(nlp.vocab)
        self.nlp = nlp
        Token.set_extension("features", default=False)
        attribute_word_dict = self._load_dict()
        self.nlp.vocab.strings.add('features')
        for label in attribute_word_dict.keys():
            self.nlp.vocab.strings.add(label)

        self.id_label_dict = {}
        for label in attribute_word_dict.keys():
            self.id_label_dict[self.nlp.vocab.strings[label]] = label

        for label, terms in attribute_word_dict.items():
            patterns = [nlp.make_doc(text) for text in terms]
            self.matcher.add(label, None, *patterns)

    def __call__(self, doc):

        matches = self.matcher(doc)
        spans = []
        for match_id, start, end in matches:
            label = self.id_label_dict[match_id]
            entity = Span(doc, start, end, label=self.nlp.vocab.strings[label])
            spans.append(entity)
            for token in entity:
                token._.set("features", label)

        for span in spans:
            span.merge()
        return doc

class DictionaryExtractor(RelationExtractor):

    def __init__(self):
        from spacy.lang.zh import Chinese
        self.nlp = Chinese()
        dict_feature_component = DictionaryFeatureComponent(self.nlp)
        self.nlp.add_pipe(dict_feature_component)
        logger.info("DictionaryExtractor initialized ...")

    def _batch_nlp_process(self, textlines):
        log_every_n = 10000
        i = 1
        for doc in self.nlp.pipe(textlines, n_process=-1):
            if (i % log_every_n) == 0:
                logger.info('Doc: {}'.format(str(i)))
            i += 1
            yield doc

If I suspend this part:

for span in spans:
            span.merge()

The time reduced from 2.55 to 0.28, 10 times faster, which would be the normal speed. It feels like this part doesn't take advantage of the muli-processing capabilities? I tested many times, and the bottle neck is on this merge function. There are only 1 or 2 matches for each doc, and the merge function should be very light. Don't understand.

0 replies

lingvisa · 2020-02-14T23:17:29Z

lingvisa
Feb 14, 2020
Author

Another thing I found that extensions works differently on multi- and single- processing:

doc._.extracted = spans

For the document extension, if the 'spans' is a string above, it works for both; if it is a list, it won't work for multile-processing in spacy:

File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.span.Span' object

I also tested using:
doc.user_data["extracted"] = spans

I works in single-processing, but not for multi-processing in batch mode.

0 replies

adrianeboyd · 2020-02-17T07:42:39Z

adrianeboyd
Feb 17, 2020

The Matcher IN may be what you're looking for: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes-extended

span.merge() is deprecated in favor of the retokenizer, which is more efficient: https://spacy.io/usage/linguistic-features#retokenization

If you want to have custom attributes that can be serialized for multiprocessing, I'd recommend using something like a list of tuples with the required Span attributes like (start, end, label) instead of the span objects themselves. (This will look like the results returned by the Matcher, which are also intended to be the most lightweight representation of the spans.)

0 replies

lingvisa · 2020-02-18T08:05:38Z

lingvisa
Feb 18, 2020
Author

The IN works and I found that, though the syntax could be simplified.

I will try the tuple(start, end, label) option.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement a custom component for multi-phrase matcher? #5010

{{title}}

Replies: 15 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to implement a custom component for multi-phrase matcher? #5010

lingvisa Feb 13, 2020

Replies: 15 comments

adrianeboyd Feb 13, 2020

lingvisa Feb 13, 2020 Author

lingvisa Feb 13, 2020 Author

lingvisa Feb 13, 2020 Author

adrianeboyd Feb 13, 2020

lingvisa Feb 13, 2020 Author

adrianeboyd Feb 13, 2020

lingvisa Feb 13, 2020 Author

lingvisa Feb 14, 2020 Author

adrianeboyd Feb 14, 2020

lingvisa Feb 14, 2020 Author

lingvisa Feb 14, 2020 Author

lingvisa Feb 14, 2020 Author

adrianeboyd Feb 17, 2020

lingvisa Feb 18, 2020 Author

lingvisa
Feb 13, 2020

adrianeboyd
Feb 13, 2020

lingvisa
Feb 13, 2020
Author

lingvisa
Feb 13, 2020
Author

lingvisa
Feb 13, 2020
Author

adrianeboyd
Feb 13, 2020

lingvisa
Feb 13, 2020
Author

adrianeboyd
Feb 13, 2020

lingvisa
Feb 13, 2020
Author

lingvisa
Feb 14, 2020
Author

adrianeboyd
Feb 14, 2020

lingvisa
Feb 14, 2020
Author

lingvisa
Feb 14, 2020
Author

lingvisa
Feb 14, 2020
Author

adrianeboyd
Feb 17, 2020

lingvisa
Feb 18, 2020
Author