Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

MITADS - new corpora to import #117

Open
eziolotta opened this issue Jan 3, 2021 · 3 comments
Open

MITADS - new corpora to import #117

eziolotta opened this issue Jan 3, 2021 · 3 comments
Labels
dataset enhancement New feature or request good first issue Good for newcomers

Comments

@eziolotta
Copy link
Contributor

eziolotta commented Jan 3, 2021

We could include new corpora, works and texts in MITADS Dataset to increase the size of its vocabulary.

In 2021, all works written by people who died during the 1950s are released from copyright.

We could be collected works of these Italian writers died in 1950:

  • Giovanni Paneroni, Italian writer
  • Gaetano Pitta, Italian writer and journalist
  • Tullio Giordana, Italian writer, journalist and lawyer
  • Rafael Sabatini, Italian writer
  • Carlo Morandi, Italian historian and writer
  • Francesco Jovine, Italian writer, journalist and essayist
  • Cesare Pavese, Italian writer, poet and translator
  • Giovanni Bertinetti, Italian writer
  • Trilussa, Italian poet, writer and journalist
  • Umberto Notari, Italian journalist, writer and publisher
  • Gastone Razzaguta, Italian writer, painter and art critic

https://it.wikisource.org/wiki/Categoria:Morti_nel_1950

@Mte90 Mte90 added dataset enhancement New feature or request good first issue Good for newcomers labels Jan 3, 2021
@Mte90
Copy link
Member

Mte90 commented Jan 3, 2021

Thanks I was thinking to do a ticket with those :-D

I was thinking to add also the new speech of the president.

About Trilussa we need to check, as Poet he was wiring a lot of stuff in roman dialect and for our needs is not suitable.
Anyway we should check for content that is like discussion or wrote in first person, so journalism stuff is perfect.

The file to add those stuff is for wikisource: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/MITADS/assets/wikisource_books.txt

In the italian wikimedia community they are discussing to move that new stuff in the website so we have to wait a bit.

@Mte90
Copy link
Member

Mte90 commented Jan 3, 2021

I checked also for gutenberg, we can add those new books:

  • 34983
  • 49231

@eziolotta
Copy link
Contributor Author

Facebook in 2020 released the cleaned common-crawl-data which they used to train XLM-R Model
http://data.statmt.org/cc-100/

Italian dataset - 7.8G
http://data.statmt.org/cc-100/it.txt.xz

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dataset enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants