Exploiting available linguistic resources for the Italian language #3801
gtoffoli
started this conversation in
Language Support
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently I have no resources to contribute to the spaCy project, but I think that it could be useful to point out some language resources available for the Italian language. The most interesting I am aware of is the free morphological lexicon morph-it, which was compiled based also on a large annotated corpus comprising many years of the national daily newspaper Repubblica; I had the opportunity of using both the lexicon and the corpus a few years ago, in learning to train some NLTK pos-taggers and chunkers.
morph-it contains about 500.000 word forms, annotated with pos-tags and other features, while the current Italian lemmatizer of spaCy contains about 333.000 word forms.
Since a classical example of Italian ambiguous sentence, containing multiple ambiguous words, is "La vecchia porta la sbarra" (The old woman carries the bar / The old door bars it), I looked for the word form "porta" and found only one (improbable) entry in the spaCy lemmatizer map, versus 5 entries (4 lemmata) in morph-it. References:
https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
https://github.com/giodegas/morphit-lemmatizer/tree/master/master
The corpus annotation is good, although not perfect; it was done in both manual and automatic way. Years ago the corpus wasn't open, but I had access to it without difficulty telling that I needed it to train some algorithms. Moreover, I think that the corpus was annotated using an approach similar to that being used in the "WaCky - The Web-As-Corpus" multi-lingual project, which probably you already know; the products of this project are open. References:
https://wacky.sslmit.unibo.it/doku.php
Beta Was this translation helpful? Give feedback.
All reactions