-
Notifications
You must be signed in to change notification settings - Fork 20
LIST OF AUDIO+TEXT DATASETS #114
Comments
MLS from facebook has Italiano: 279.43 hours |
@nshmyrev WOW, thank you for this christmas present!! |
I performed script for import MLS files in MITADS-Speech Datasets https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz (14.3G zip) Converting the .flac audio files to Wav 16KHz and doing some checks. With sample tests, Audio are of good quality and transcripts are clean. if my script work fine All clips <= 15 seconds (and successfully resampled) Textual corpus on which speech dataset is based, includes ancient works like this: In some cases we find in sentences obsolete forms and terms that person using speech technologies today is unlikely to pronounce. |
I think that we should avoid this kind of ancient work (except pirandello) as we did in Mitads itself as example. |
https://arxiv.org/pdf/2101.00390.pdf VoxPopuli: largest open unlabelled speech dataset, totaling 100K hours in 23 languages from European Parliament. They will release the corpus at https://github.com/facebookresearch/voxpopuli under a open license. |
Europarl-ST , multilingual corpus for speech translation of parliamentary debates. Total 64.18h of Italian clips audio. |
New Speech Dataset: Multilingual TEDx by OpenSLR The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages Licence: Creative Commons Attribution-NonCommercial-NoDerivs 4.0 UPDATE: |
For Multilingual TEDx dataset: Unfortunately Deep Speech doesn't have a powerfull toolkit to split audio by aligner or vad, like Kaldi or others. |
You can import the mTEDx dataset with corcua: Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:) |
Dear DanBmh, thank you so much! I will update the table above |
LIST OF ALL ITALIAN DATASETS FOUND
From issue #90 I'm putting here all the datasets that have been discovered.
Some of them are plug-and-play for Deepspeech others instead need to be created from scratch (splits up audio by sentences)
Feel free to pickup one that has not been done for checking it out.
NOTE
If one of this dataset needs a deeper analysis please do not start a discussion here but open a new issue and I will update this table with the issue reference.
DATASETS
The text was updated successfully, but these errors were encountered: