Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Voxforge bad samples, help for cleaning up #111

Open
12 tasks done
nefastosaturo opened this issue Dec 11, 2020 · 3 comments
Open
12 tasks done

Voxforge bad samples, help for cleaning up #111

nefastosaturo opened this issue Dec 11, 2020 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@nefastosaturo
Copy link
Collaborator

nefastosaturo commented Dec 11, 2020

EDIT:

So, with some audio analysis we found some ugly speakers but for all the other speakers a manual check is needed.

If you want to help, please:

  1. choose a speaker from here: http://www.voxforge.org/it/Downloads (the optimal is to choose a speaker that had recorded lot of minutes)
  2. download its archive from here: http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/
  3. listen to the audios and tell us if it is valid or not, which segment is not valid or everything from that speaker must be discarded

A valid audio must contain speech, even with very low volume and must be understandable.
For example Vistaus-20080718-mrm is not a valid one

DONE!

I've found some bad samples in this dataset. So I've just search for audio files with an average RMS below 0.025 value and I found these speakers that need to be checked:

  • anonymous-20080504-qvg - NO
  • anonymous-20080723-ouv - NO
  • anonymous-20080725-dey - NO
  • anonymous-20110605-kpd
  • anonymous-20170303-mwy
  • dario-20110426-yhj
  • Karm-20131225-irq
  • nannioz-20091103-qfc - ok
  • nannioz-20091103-raj - ok
  • nannioz-20091103-vkr - ok
  • nannioz-20091103-zhz - ok
  • Stefano-20150131-pus - ok

Also there is one speaker that is not italian and I'll remove it:

Vistaus-20080718-mrm

So, I'm asking you if you can choose two speakers, listen to their recordings and report if there is something VERY wrong (eg we can keep very-low volume but understandable recordings ).

You'll find all the recordings here http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/

A csv containing all the samples with their RMS is attached
voxforge_bad_samples.zip

@nefastosaturo nefastosaturo added the help wanted Extra attention is needed label Dec 11, 2020
@dag7dev
Copy link
Member

dag7dev commented Dec 13, 2020

Controllati nannioz e Stefano: aggiorno l'issue e qui sotto.

nannioz-20091103-qfc - ok
nannioz-20091103-raj - ok
nannioz-20091103-vkr - ok
nannioz-20091103-zhz - ok
Stefano-20150131-pus - ok

@nefastosaturo
Copy link
Collaborator Author

ok so until we do not find other strange samples, we are done here. I'm leaving the issue open for future checks

@eziolotta
Copy link
Contributor

anonymous-20080725-dey - NO - EMPTY AUDIO
anonymous-20110605-kpd - OK - low-volume but understandable audio
anonymous-20170303-mwy - OK - low-volume but understandable audio
dario-20110426-yhj - OK
Karm-20131225-irq - OK

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants