Getting inflected forms using Spacy #3171
Replies: 11 comments
-
This would be useful, but we don't have a function for it currently. I'd love to see an extension published that did this. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your reply. It would be nice if this option is available. |
Beta Was this translation helpful? Give feedback.
-
So, I actually generated a file that does something like this based on the Automatically Generated Inflection Database. {"be": {"VB": ["be"], "VBP": ["am", "are"], "VBN": ["been"], "VBD": ["was", "were"], "VBG": ["being"], "VBZ": ["is"]}} The idea was to use the lemma + POS to get the right form. I can maybe put it online in a repository and you can use it? |
Beta Was this translation helpful? Give feedback.
-
@chrisjbryant This could be a nice use case for a plugin exposing a custom extension attribute or method – maybe something like Here's a quick proof of concept: from spacy.tokens import Token
def inflect(token, pos):
# This extension method will receive the token it's added on,
# as well as any additional arguments it's passed
forms = YOUR_DICTIONARY.get(token.lemma_, {})
return forms.get(pos, [])
Token.set_extension('inflect', method=inflect)
doc = nlp("I am a cat")
doc[1]._.inflect("VBD") # ["was", "were"]
doc[3]._.inflect("NNS") # ["cats"] This will add a method |
Beta Was this translation helpful? Give feedback.
-
Sounds good. I'm super busy right now but will get to it at some point and let you know! |
Beta Was this translation helpful? Give feedback.
-
I had need for something similar so I went ahead and created the extension. See pyInflect. While this generally works, there are a number of issues. I haven't created the pip installer yet. I thought I'd get your feedback first and spend some time seeing if I can fix a few of the outliers before posting it for general consumption. There is a lexical resource from NIH which includes inflection rules for irregular verbs and generic rules for other words. It may be more accurate to use this as a resource and inflect words on the fly rather than use the AGID which appears to have some issues. I'm also thinking that it may be a good idea to overload the "getInflection" method so the user can supply a string with an enum for the type of inflection, rather that just the treebank tag. Your thoughts/feedback is appreciated. |
Beta Was this translation helpful? Give feedback.
-
Looks good! Some feedback:
In this case, nothing is returned because the system only has the ability to convert lemmas into inflected forms and so can't handle alternatives. I tried I notice from the code you allow the "RP" tag for particles. This tag is for particles/prepositions in phrasal verbs (e.g. "at" in "look at"), amongst others, so I'm not sure you want it as a valid option here. I think you already noticed most of the problem cases, but here are some others I found too: Relating to the above, I think you also forgot to remove some of the comments from the AGID:
There aren't many of these, but some entries have curly brackets to differentiate between meanings. I see you also just took the first entry as the main form. It's true the first entry is normally the most likely, but I think it'd also be nice to have a method
In the version I was developing, I was also considering a function that returned all the known forms of a lemma as a dictionary: {lemma: {POS: form, ...}} Hope these comments are helpful and thanks for doing this! |
Beta Was this translation helpful? Give feedback.
-
1 - From your reply, "the system only has the ability to convert lemmas into For the case of 2 - Good catch on particles. Wikipedia even says... "Particles are never inflected." 4 - Thanks. I missed removing the explanation between the brackets. 5 - I agree it would be good to return multiple spellings/forms when available. This doesn't fit |
Beta Was this translation helpful? Give feedback.
-
Nice - definitely looking forward to the changes you mention in 5! Regarding 1, it was mainly a comment that the inflection in spacy only works from the base form and you can't uninflect/reinflect already inflected forms. It would be nice to be able to do:
But this would require a more complicated data structure to link between different forms. The main advantage would be we'd be able to bypass the lemmatiser, but I similarly can't see an easy way to do it without enumerating all the possibilities in the file. |
Beta Was this translation helpful? Give feedback.
-
I'm curious if you have specific cases where the Spacy lemetizer is creating issues. Creating a dictionary mapping from every inflection back to it's base form at load time fairly trivial. Unfortunately in the AGID there's about 2,300 inflections that map back to more than one base word so it's not simple to use that mapping. For example the inflection
I think the above is an example where the AGID has invalid data since If there are cases where the Spacy lemmetizer isn't working correctly, it might be possible to create some logic that would overcome the issue using this reverse mapping. |
Beta Was this translation helpful? Give feedback.
-
I uploaded a new version with a fair number of changes, including adding two methods |
Beta Was this translation helpful? Give feedback.
-
Is there any way to get different forms of a word using spacy?
e.g.,
Input: VBZ form of the term are
Output: is
Input: NNS form of the term cat
Output: cats
Beta Was this translation helpful? Give feedback.
All reactions