Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other considerations for what makes a good TTS dataset? #26

Open
wanshun123 opened this issue Mar 9, 2019 · 1 comment
Open

Other considerations for what makes a good TTS dataset? #26

wanshun123 opened this issue Mar 9, 2019 · 1 comment

Comments

@wanshun123
Copy link

I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron, starting from a pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions (and similar character length graphs) according to analyze.py in this repo (thanks for making that by the way).

To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?

Any comments appreciated.

@el-tocino
Copy link

This echoes some of what you've noticed:
https://www.reddit.com/r/MachineLearning/comments/a90u3t/d_what_makes_a_good_texttospeech_dataset/ (see comment by erogol)
Also of note is using the waveglow/wavernn tools can help smooth things out as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants