How to implement a custom component for multi-phrase matcher? #5010
Replies: 15 comments
-
Hi, you want to use your category as the
This is just to show the labels more clearly, for real use you'd have this in a loop more like:
It could also make sense use a callback to set the custom extension, here's a fairly similar example: https://spacy.io/usage/rule-based-matching#example3 |
Beta Was this translation helpful? Give feedback.
-
Then in the call method:
How to get the feature type? I can't hard-code the 'ANIMAL'. One way I could think of is to use each label as the match_id, create a match_id to label map, then in the call method, I can retrieve the label through the match_id to label map. Is there a better way to do this? |
Beta Was this translation helpful? Give feedback.
-
I think the emojis example resolved my further question above. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi, Adriane: I am always not super clear about how the vocab gets initialized. I thought it is initialized through a model, in which all vocabulary in a language are stored and hashed, then it can be looked up. But in this code:
Why this line works? When and how is the string 'ORG' already stored in the vocab and can be retrieved? I thought you will have to use 'nlp.vocab.strings.add(label)', then you can look it up. In my own implementation in the DictionaryFeatureComponent (I am not using the default English model), I have to add those labels to the vocab store first, then I can get an integer. How does this example work? |
Beta Was this translation helpful? Give feedback.
-
What is a bit confusing here is that |
Beta Was this translation helpful? Give feedback.
-
If nlp.vocab.strings[label] can always generate a hash value for a string, then why do need to add it first
Japanese: |
Beta Was this translation helpful? Give feedback.
-
This section of the docs might be helpful: https://spacy.io/usage/spacy-101#vocab |
Beta Was this translation helpful? Give feedback.
-
That link helps. Still a related issue with token attribute extension is this code to generate an error message:
Here two lines are relevant: I intended to create a set of user features for a token, for example, the 'dog' is associated with a number of features: I want to use a set instead of a list to store the features because that allows for faster look up. However, this gives an error message, but if I change the 'features' to a list, the issue is gone. The error is below:
|
Beta Was this translation helpful? Give feedback.
-
Continuing with my need above, my custom component works:
To test it:
This will print out ['animal', 'four_legs'] for the token 'dog'. Now I want to make use of these features to write patterns, as simple as below:
I suppose the callback should print out the word 'dog' since it has the feature 'animal'. However, it prints out the following error:
The error occurs at this line: The error comes from line 287, matcher.pyx:
I print out and found i, nr_extra_attr, and index are all integers, and value is a list (['animal']), which is my custom attribute extension. Don't understand why it gives this error. |
Beta Was this translation helpful? Give feedback.
-
However, I suspect a deeper problem is that features as a list or dict isn't going to be supported by
I think you'll need values that are boolean, integer, or string for this to be supported by the |
Beta Was this translation helpful? Give feedback.
-
If would be very nice if the extension and matcher can support something like this:
meaning that a token can have either 'animal' or 'four_legs' feature. For now, I can use multiple patterns and use default=False for the extension, and it works:
|
Beta Was this translation helpful? Give feedback.
-
So far it worked all right. However, just found that when testing on a file, the span.merge() make the pipeline very slow.
If I suspend this part:
The time reduced from 2.55 to 0.28, 10 times faster, which would be the normal speed. It feels like this part doesn't take advantage of the muli-processing capabilities? I tested many times, and the bottle neck is on this merge function. There are only 1 or 2 matches for each doc, and the merge function should be very light. Don't understand. |
Beta Was this translation helpful? Give feedback.
-
Another thing I found that extensions works differently on multi- and single- processing:
For the document extension, if the 'spans' is a string above, it works for both; if it is a list, it won't work for multile-processing in spacy:
I also tested using: I works in single-processing, but not for multi-processing in batch mode. |
Beta Was this translation helpful? Give feedback.
-
The
If you want to have custom attributes that can be serialized for multiprocessing, I'd recommend using something like a list of tuples with the required |
Beta Was this translation helpful? Give feedback.
-
The IN works and I found that, though the syntax could be simplified. I will try the tuple(start, end, label) option. |
Beta Was this translation helpful? Give feedback.
-
I have a large dictionary of words with their features (or entity type), like below:
term_dict = {'ANIMAL:':['dog', 'pig', 'goat'],
'PLANT': ['apple', 'grass'],
'OBJECT': ['table', 'book', 'pole']}
Given a sentence like: "This is a dog and that is a table". I want to assign 'ANIMAL' and 'OBJECT' to the tokens 'dog' and 'table' respectively, with a custom token attribute 'features'. In the documentation, there is an example that assigns single entity type to a token with a custom NLP component. In my example, there a three types of features (entities) to be recognized and assigned. How to make this possible?
The above code works for a single entity type assignment, i.e. 'ANIMAL', if the term_dict only stores animal terms. But How to make it work for the three at the same time? The issue I am having is that, in the initializer, it seems each matcher can only works for one type pattern? I store the 3 types in a dict, but then how to make that type information available in the call method?
Ideally, I don't want to create a custom component for each type of entity. That won't be feasible for large number of entity assignment.
Beta Was this translation helpful? Give feedback.
All reactions