MITADS - new corpora to import #117

eziolotta · 2021-01-03T11:10:47Z

We could include new corpora, works and texts in MITADS Dataset to increase the size of its vocabulary.

In 2021, all works written by people who died during the 1950s are released from copyright.

We could be collected works of these Italian writers died in 1950:

Giovanni Paneroni, Italian writer
Gaetano Pitta, Italian writer and journalist
Tullio Giordana, Italian writer, journalist and lawyer
Rafael Sabatini, Italian writer
Carlo Morandi, Italian historian and writer
Francesco Jovine, Italian writer, journalist and essayist
Cesare Pavese, Italian writer, poet and translator
Giovanni Bertinetti, Italian writer
Trilussa, Italian poet, writer and journalist
Umberto Notari, Italian journalist, writer and publisher
Gastone Razzaguta, Italian writer, painter and art critic

https://it.wikisource.org/wiki/Categoria:Morti_nel_1950

Mte90 · 2021-01-03T13:21:39Z

Thanks I was thinking to do a ticket with those :-D

I was thinking to add also the new speech of the president.

About Trilussa we need to check, as Poet he was wiring a lot of stuff in roman dialect and for our needs is not suitable.
Anyway we should check for content that is like discussion or wrote in first person, so journalism stuff is perfect.

The file to add those stuff is for wikisource: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/MITADS/assets/wikisource_books.txt

In the italian wikimedia community they are discussing to move that new stuff in the website so we have to wait a bit.

Mte90 · 2021-01-03T13:43:08Z

I checked also for gutenberg, we can add those new books:

34983
49231

eziolotta · 2021-01-03T20:13:50Z

Facebook in 2020 released the cleaned common-crawl-data which they used to train XLM-R Model
http://data.statmt.org/cc-100/

Italian dataset - 7.8G
http://data.statmt.org/cc-100/it.txt.xz

Mte90 added dataset enhancement New feature or request good first issue Good for newcomers labels Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MITADS - new corpora to import #117

MITADS - new corpora to import #117

eziolotta commented Jan 3, 2021 •

edited by Mte90

Loading

Mte90 commented Jan 3, 2021

Mte90 commented Jan 3, 2021

eziolotta commented Jan 3, 2021

MITADS - new corpora to import #117

MITADS - new corpora to import #117

Comments

eziolotta commented Jan 3, 2021 • edited by Mte90 Loading

Mte90 commented Jan 3, 2021

Mte90 commented Jan 3, 2021

eziolotta commented Jan 3, 2021

eziolotta commented Jan 3, 2021 •

edited by Mte90

Loading