Getting inflected forms using Spacy #3171

aafreenart · 2019-01-19T05:49:12Z

aafreenart
Jan 19, 2019

Is there any way to get different forms of a word using spacy?
e.g.,
Input: VBZ form of the term are
Output: is

Input: NNS form of the term cat
Output: cats

honnibal · 2019-01-21T12:37:15Z

honnibal
Jan 21, 2019
Maintainer

This would be useful, but we don't have a function for it currently. I'd love to see an extension published that did this.

0 replies

aafreenart · 2019-01-21T12:43:28Z

aafreenart
Jan 21, 2019
Author

This would be useful, but we don't have a function for it currently. I'd love to see an extension published that did this.

Thank you for your reply. It would be nice if this option is available.

0 replies

chrisjbryant · 2019-02-19T14:13:12Z

chrisjbryant
Feb 19, 2019

So, I actually generated a file that does something like this based on the Automatically Generated Inflection Database.
Entries in the AGID are classified by POS and ordered by form, so I was able to produce something like the following (JSON format).

{"be": {"VB": ["be"], "VBP": ["am", "are"], "VBN": ["been"], "VBD": ["was", "were"], "VBG": ["being"], "VBZ": ["is"]}}
{"beach": {"VBN": ["beached"], "VB": ["beach"], "VBG": ["beaching"], "NN": ["beach"], "VBP": ["beach"], "VBZ": ["beaches"], "VBD": ["beached"], "NNS": ["beaches"]}}

The idea was to use the lemma + POS to get the right form. I can maybe put it online in a repository and you can use it?

0 replies

ines · 2019-02-19T14:56:28Z

ines
Feb 19, 2019
Maintainer

@chrisjbryant This could be a nice use case for a plugin exposing a custom extension attribute or method – maybe something like token._.inflect(pos)?

Here's a quick proof of concept:

from spacy.tokens import Token

def inflect(token, pos):
    # This extension method will receive the token it's added on,
    # as well as any additional arguments it's passed
    forms = YOUR_DICTIONARY.get(token.lemma_, {})
    return forms.get(pos, [])

Token.set_extension('inflect', method=inflect)

doc = nlp("I am a cat")
doc[1]._.inflect("VBD")  # ["was", "were"]
doc[3]._.inflect("NNS")  # ["cats"]

This will add a method ._.inflect on each token that takes a part-of-speech tag. When called, it looks up the token's lemma in the dictionary and returns entries for the given POS tag, if available.

0 replies

chrisjbryant · 2019-02-22T00:31:51Z

chrisjbryant
Feb 22, 2019

Sounds good. I'm super busy right now but will get to it at some point and let you know!

0 replies

bjascob · 2019-04-06T15:43:24Z

bjascob
Apr 6, 2019

I had need for something similar so I went ahead and created the extension. See pyInflect.

While this generally works, there are a number of issues.
1 - Ambiguous was/were and am/are
2 - Missing or incorrect AGID lemmas
3 - Inaccurate AGID forms of words
There are a few specifics listed in the KnownIssues.txt file. You can also run the script 12_RunCorpusAutoTest.py to have it scroll through a small corpus in NLTK and print out potential issues.

I haven't created the pip installer yet. I thought I'd get your feedback first and spend some time seeing if I can fix a few of the outliers before posting it for general consumption.

There is a lexical resource from NIH which includes inflection rules for irregular verbs and generic rules for other words. It may be more accurate to use this as a resource and inflect words on the fly rather than use the AGID which appears to have some issues.

I'm also thinking that it may be a good idea to overload the "getInflection" method so the user can supply a string with an enum for the type of inflection, rather that just the treebank tag.

Your thoughts/feedback is appreciated.

0 replies

chrisjbryant · 2019-04-06T23:12:49Z

chrisjbryant
Apr 6, 2019

Looks good!

Some feedback:
1.
I currently find the standalone case more useful than the spacy extension because it takes strings as input rather than spacy tokens. Consider the following:

> import pyinflect
> doc = nlp('These examples.')
> doc[1]._.inflect('NN')

In this case, nothing is returned because the system only has the ability to convert lemmas into inflected forms and so can't handle alternatives. I tried doc[1].lemma_._.inflect('NN'), but it didn't work unless I also converted the lemma to a spacy token which seems a bit unnecessary.

I notice from the code you allow the "RP" tag for particles. This tag is for particles/prepositions in phrasal verbs (e.g. "at" in "look at"), amongst others, so I'm not sure you want it as a valid option here.

I think you already noticed most of the problem cases, but here are some others I found too:
a. "only": only has the superlative form "onliest" but no comparative form.
b. "methinks": only has the form "methought" in the past (VBD).
c. "be" and "wit" are special as you already noticed.
d. Modal verbs are also exceptional:
"may": may, mayst, might
"shall": shall, shalt, should
"can": can, canst, could
"will": will, wilt, would, wouldst
"can" and "will" are also problematic as they have other non-modal senses so "can, canned, canning, cans" and "will, willed, willing, wills" are also acceptable.

Relating to the above, I think you also forgot to remove some of the comments from the AGID:

>>> from pyinflect import InflectionEngine
>>> InflectionEngine().getInflection("can", "VBZ")
'canbeable'

There aren't many of these, but some entries have curly brackets to differentiate between meanings.

I see you also just took the first entry as the main form. It's true the first entry is normally the most likely, but I think it'd also be nice to have a method getAllInflections which returns the alternative forms too. For example:

>>> from pyinflect import InflectionEngine
>>> InflectionEngine().getAllInflections("formula", "NNS")
['formulas', 'formulae']

In the version I was developing, I was also considering a function that returned all the known forms of a lemma as a dictionary: {lemma: {POS: form, ...}}

Hope these comments are helpful and thanks for doing this!

0 replies

bjascob · 2019-04-07T14:18:15Z

bjascob
Apr 7, 2019

1 - From your reply, "the system only has the ability to convert lemmas into
inflected forms and so can't handle alternatives", I'm not sure if your suggesting
a change or just commenting on the state of things. If you see a change, please explain.

For the case of doc[1].lemma_._.inflect('NN'), I don't see any easy change since lemma_
is a simple python str and won't have these methods without some significant changes.

2 - Good catch on particles. Wikipedia even says... "Particles are never inflected."

4 - Thanks. I missed removing the explanation between the brackets.

5 - I agree it would be good to return multiple spellings/forms when available. This doesn't fit
very well with the Spacy Token._.inflect('xx') but would work with a separate function.
I think I'll make the following...
* Make .inflect always return a single string (no list, even in the case of 'be')
* Make getInflections return a list of spellings for the given tag, even if only 1 spelling exist.
I'll also set tag=None and if the tag isn't supplied, the list of all forms/spellings are returned.
* I'll also remove the need for a call to "InflectionEngine()" and just put functions in the init.py
that wrap the class instance.

0 replies

chrisjbryant · 2019-04-07T17:10:33Z

chrisjbryant
Apr 7, 2019

Nice - definitely looking forward to the changes you mention in 5!

Regarding 1, it was mainly a comment that the inflection in spacy only works from the base form and you can't uninflect/reinflect already inflected forms. It would be nice to be able to do:

> import pyinflect
> doc = nlp('I am eating.')
> doc[2]._.inflect('VBP')
'eaten'

But this would require a more complicated data structure to link between different forms. The main advantage would be we'd be able to bypass the lemmatiser, but I similarly can't see an easy way to do it without enumerating all the possibilities in the file.

0 replies

bjascob · 2019-04-07T18:20:18Z

bjascob
Apr 7, 2019

I'm curious if you have specific cases where the Spacy lemetizer is creating issues.

Creating a dictionary mapping from every inflection back to it's base form at load time fairly trivial. Unfortunately in the AGID there's about 2,300 inflections that map back to more than one base word so it's not simple to use that mapping. For example the inflection wises (as in he wises up maybe)

wis,V,wissed,<>,wissing,wises
wise,V,wised,<>,wising,wises

I think the above is an example where the AGID has invalid data since wis is not a valid word that I'm aware of. Regardless, this is the data we have to with unless I switch over to the NIH lexicon.

If there are cases where the Spacy lemmetizer isn't working correctly, it might be possible to create some logic that would overcome the issue using this reverse mapping.

0 replies

bjascob · 2019-04-08T14:36:10Z

bjascob
Apr 8, 2019

I uploaded a new version with a fair number of changes, including adding two methods getInflection and getAllInflections that can be imported directly. I posted this version on PyPI so you can now do a standard pip install. If there are any comments on that code, consider posting them directly to that project's issues list and we can close this thread.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting inflected forms using Spacy #3171

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Getting inflected forms using Spacy #3171

Replies: 11 comments

honnibal Jan 21, 2019 Maintainer

aafreenart Jan 21, 2019 Author

ines Feb 19, 2019 Maintainer

honnibal
Jan 21, 2019
Maintainer

aafreenart
Jan 21, 2019
Author

ines
Feb 19, 2019
Maintainer