LIST OF AUDIO+TEXT DATASETS #114

nefastosaturo · 2020-12-17T18:09:44Z

LIST OF ALL ITALIAN DATASETS FOUND

From issue #90 I'm putting here all the datasets that have been discovered.
Some of them are plug-and-play for Deepspeech others instead need to be created from scratch (splits up audio by sentences)

Feel free to pickup one that has not been done for checking it out.

NOTE

If one of this dataset needs a deeper analysis please do not start a discussion here but open a new issue and I will update this table with the issue reference.

DATASETS

dataset	hrs	url	plug-n-play	TODOs	doing	done	note
MLS	279.43 h	↗					HOT!!!!
VoxForge #111	20h	↗	✔	url replace in DS import_voxforge.py script fix import sys error	✔
MAILABS	127h40m	↗	✔			✔
Evalita2009	5h	↗				✔
MSPKA	3h	↗				✔
SIWIS	4.5h	↗				✔
SUGAR	1.5h	↗					sentences are not useful
VociParlateWikipedia #34	?	↗		sync audio with its page revision
EMOVO	~12m	↗		align filename codes with their sentences			interesting for emotions (disgust, happy..)
ZIta	<1hr	↗					transcriptions do not follow recordings (eg: Lett_Z_Sp1_zero.wav)
LIM_Veneti	<1hr	↗					no audio files?
split-MDb	~46m	↗		parse&clean the .wrd files			based on CLIPS
tg60	1h30m	↗		long audio files to be split			maybe among the info files there are some timings that could be useful for splitting up?
PraTiD	1h12m	↗		long audio files to be split			From CLIPS; maybe among the info files there are some timings that could be useful for splitting up?
ParlatoCinematografico	?	↗		long audio files to be split			.lab files with speakers timings
PerugiaCorpusPEC	?	↗					a login is needed. License?

nshmyrev · 2020-12-17T22:36:32Z

MLS from facebook has Italiano:

http://openslr.org/94/

279.43 hours

nefastosaturo · 2020-12-18T09:04:53Z

@nshmyrev WOW, thank you for this christmas present!!

eziolotta · 2020-12-19T22:19:11Z

I performed script for import MLS files in MITADS-Speech Datasets

https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz (14.3G zip)

Converting the .flac audio files to Wav 16KHz and doing some checks.

With sample tests, Audio are of good quality and transcripts are clean.
All clips are between 10 and 20 seconds long (specified in paper)

if my script work fine All clips <= 15 seconds (and successfully resampled)
are in total: 159.23h

Textual corpus on which speech dataset is based, includes ancient works like this:
works of Giovanni Francesco Straparola (1400),
Divina Commedia (and others) by Dante Alighieri (1300)
works of Luigi Pirandello (1900)

In some cases we find in sentences obsolete forms and terms that person using speech technologies today is unlikely to pronounce.
If we need to filter clips by Author/Work, this information is present in flac audio file

Mte90 · 2020-12-19T22:34:11Z

I think that we should avoid this kind of ancient work (except pirandello) as we did in Mitads itself as example.

eziolotta · 2021-01-17T09:06:23Z

https://arxiv.org/pdf/2101.00390.pdf

VoxPopuli: largest open unlabelled speech dataset, totaling 100K hours in 23 languages from European Parliament.
Also contains 1.8K hours of transcribed speeches in 16 languages.

They will release the corpus at https://github.com/facebookresearch/voxpopuli under a open license.

eziolotta · 2021-01-17T20:50:10Z

Europarl-ST , multilingual corpus for speech translation of parliamentary debates.

Total 64.18h of Italian clips audio.
From approximate calculations about 30% of these (20h?) have a transcript in Italian
Most clips range duration from 1 minute to 2 minutes

https://arxiv.org/pdf/1911.03167.pdf

https://www.mllp.upv.es/europarl-st/v1.1.tar.gz (20G)

eziolotta · 2021-02-27T08:54:31Z

New Speech Dataset: Multilingual TEDx by OpenSLR
https://www.openslr.org/100/

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages
Italian: about 123 Hours

Licence: Creative Commons Attribution-NonCommercial-NoDerivs 4.0
https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy

UPDATE:
Audio clips have an average length, ranging from 4-5 minutes to 25 minutes
This does not make it directly usable for training with deep speech

eziolotta · 2021-02-27T13:54:59Z

For Multilingual TEDx dataset:
in segments.txt file there are audio segments of each clip, they are text alignment with audio timestamps.
Through this file, audio-clips could be segmented easily, but I think we cannot redistribute dataset due to the license.

Unfortunately Deep Speech doesn't have a powerfull toolkit to split audio by aligner or vad, like Kaldi or others.
Could we try this : https://espnet.github.io/espnet/apis/espnet_bin.html#asr-align-py ?

DanBmh · 2021-04-06T14:28:36Z

You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcua

Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)

nefastosaturo · 2021-06-15T15:48:44Z

You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcua

Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)

Dear DanBmh, thank you so much! I will update the table above

nefastosaturo added dataset help wanted Extra attention is needed labels Dec 17, 2020

eziolotta mentioned this issue Jan 11, 2021

MLS Importer #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LIST OF AUDIO+TEXT DATASETS #114

LIST OF AUDIO+TEXT DATASETS #114

nefastosaturo commented Dec 17, 2020 •

edited

Loading

nshmyrev commented Dec 17, 2020 •

edited

Loading

nefastosaturo commented Dec 18, 2020

eziolotta commented Dec 19, 2020 •

edited

Loading

Mte90 commented Dec 19, 2020

eziolotta commented Jan 17, 2021 •

edited

Loading

eziolotta commented Jan 17, 2021 •

edited

Loading

eziolotta commented Feb 27, 2021 •

edited

Loading

eziolotta commented Feb 27, 2021 •

edited

Loading

DanBmh commented Apr 6, 2021

nefastosaturo commented Jun 15, 2021

LIST OF AUDIO+TEXT DATASETS #114

LIST OF AUDIO+TEXT DATASETS #114

Comments

nefastosaturo commented Dec 17, 2020 • edited Loading

LIST OF ALL ITALIAN DATASETS FOUND

NOTE

DATASETS

nshmyrev commented Dec 17, 2020 • edited Loading

nefastosaturo commented Dec 18, 2020

eziolotta commented Dec 19, 2020 • edited Loading

Mte90 commented Dec 19, 2020

eziolotta commented Jan 17, 2021 • edited Loading

eziolotta commented Jan 17, 2021 • edited Loading

eziolotta commented Feb 27, 2021 • edited Loading

eziolotta commented Feb 27, 2021 • edited Loading

DanBmh commented Apr 6, 2021

nefastosaturo commented Jun 15, 2021

nefastosaturo commented Dec 17, 2020 •

edited

Loading

nshmyrev commented Dec 17, 2020 •

edited

Loading

eziolotta commented Dec 19, 2020 •

edited

Loading

eziolotta commented Jan 17, 2021 •

edited

Loading

eziolotta commented Jan 17, 2021 •

edited

Loading

eziolotta commented Feb 27, 2021 •

edited

Loading

eziolotta commented Feb 27, 2021 •

edited

Loading