Exploiting available linguistic resources for the Italian language #3801

gtoffoli · 2019-05-31T16:13:08Z

gtoffoli
May 31, 2019

Currently I have no resources to contribute to the spaCy project, but I think that it could be useful to point out some language resources available for the Italian language. The most interesting I am aware of is the free morphological lexicon morph-it, which was compiled based also on a large annotated corpus comprising many years of the national daily newspaper Repubblica; I had the opportunity of using both the lexicon and the corpus a few years ago, in learning to train some NLTK pos-taggers and chunkers.

morph-it contains about 500.000 word forms, annotated with pos-tags and other features, while the current Italian lemmatizer of spaCy contains about 333.000 word forms.
Since a classical example of Italian ambiguous sentence, containing multiple ambiguous words, is "La vecchia porta la sbarra" (The old woman carries the bar / The old door bars it), I looked for the word form "porta" and found only one (improbable) entry in the spaCy lemmatizer map, versus 5 entries (4 lemmata) in morph-it. References:
https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
https://github.com/giodegas/morphit-lemmatizer/tree/master/master

The corpus annotation is good, although not perfect; it was done in both manual and automatic way. Years ago the corpus wasn't open, but I had access to it without difficulty telling that I needed it to train some algorithms. Moreover, I think that the corpus was annotated using an approach similar to that being used in the "WaCky - The Web-As-Corpus" multi-lingual project, which probably you already know; the products of this project are open. References:
https://wacky.sslmit.unibo.it/doku.php

honnibal · 2019-06-16T14:12:25Z

honnibal
Jun 16, 2019
Maintainer

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploiting available linguistic resources for the Italian language #3801

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Exploiting available linguistic resources for the Italian language #3801

gtoffoli May 31, 2019

Replies: 1 comment

honnibal Jun 16, 2019 Maintainer

gtoffoli
May 31, 2019

honnibal
Jun 16, 2019
Maintainer