Add lexical constraints for POS tags (like a tag dictionary) #331
Replies: 7 comments
-
Have to say I disagree with your assessment here. If I were annotating these examples manually I would definitely tag them VB. I'm actually pleasantly surprised spaCy is over-ruling the surface form and getting these examples right. The POS tag can be understood as a pre-terminal in a grammar. Would you rather have a production rule If you want to force a tagging of these based on their surface form you can. You can introduce rules into the specials.json file, or you can do a linear scan and replace the tag yourself. But getting the opposite policy after 5he fact would be difficult. |
Beta Was this translation helpful? Give feedback.
-
That's a good point, but if I'm just searching for bare infinitive verbs it's a little bit messy to find other forms mixed in with them. I'm sure I can filter them out though somehow. Can also understand why you'd want this behaviour - I doubt you designed spaCy with ungrammatical sentences in mind! |
Beta Was this translation helpful? Give feedback.
-
Well, there's actually been a fair bit of thought about how to handle ungrammatical input. What we want is the larger grammatical structure to be correct, even if the surface forms are unexpected. Can you just use the lemmas? That seems like it will do what you want. |
Beta Was this translation helpful? Give feedback.
-
Yes, I'd agree with that, but currently spaCy's ability to overwrite POS tags is too powerful. Another, wackier example: I've been discussing this with a colleague, and we think there should at least be some way to restrict the set of all possible pos labels to exclude things that are completely outlandish. |
Beta Was this translation helpful? Give feedback.
-
I think there was a bug here, that's since been fixed — that example now gets tagged correctly. Relabelling this to |
Beta Was this translation helpful? Give feedback.
-
@honnibal Do you have some general idea of a possible solution in your mind? As far as I understand the idea of tag dictionaries, it could be used to force a constant tag for each word form unconditionally, which is not optimal. Knowing that that ‘nocturia’ is not an interjection is not enough, it would be best to influence the underlying statistical model — to have it abandon (or significantly downweight) this reading and consider its second best prediction. It would be best to let this happen automatically — to define a list of tags considered “closed-class” and to let the underlying statistical model use these as constraints, such as that it's unable (or very unlikely) to output a particular label for a particular observation if this combination was not frequent enough (for instance tagging random words as prepositions). And I don't know if the underlying neural network is able to do that. There was an impressive approach to put constraints inside a CRF to limit possible tags to those attested by a morphological dictionary (https://cse.iitk.ac.in/users/cs671/2013/hw3/waszczuk-12coling_CRF-w-domainspecific-constraints-for-morpho-tagging.pdf), so I think this is possible in theory, although could require too much work in practice. |
Beta Was this translation helpful? Give feedback.
-
Hmm the more I think of it, the less tags seem to me closed enough to use this solution, and interjection is definitely not one :( Perhaps even such seemingly obvious candidates as personal pronouns should be considered partly open… because of netspeak/twitter and frequent foreign inclusions. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I've noticed quite a few cases where the PTB tag on a verb is misleading, albeit in ungrammatical sentences:
For example:
"He didn't kept my secret." - "kept" gets tagged VB.
"Actually I can't swimming at all." - "swimming" gets tagged VB.
Now while it's true you would expect a VB in those contexts, it seems worrying that any form of a verb can potentially be tagged as the base VB form if that's what licensed by context. Indeed this undermines the whole concept of a base form. Instead, I would much rather "kept" be tagged VBD (or even VBN) and "swimming" be tagged VBG even if these are unexpected in the context.
Perhaps you can restrict each form to a specific label somehow?
Beta Was this translation helpful? Give feedback.
All reactions