Need best practice for dataset about 10 hours, one speaker #665

ILG2021 · 2024-12-24T00:24:47Z

Checks

This template is only for question, not feature requests or bug reports.
I have thoroughly reviewed the project documentation and read the related paper(s).
I have searched for existing issues, including closed ones, no similar questions.
I confirm that I am using English to submit this report in order to facilitate communication.

Question details

Hello Community, thanks for your wonderful work of this nice project. I have encountered some problems with finetune and needs your help.

Recently I have tried some finetune tasks for languages like Thai, Laos, German and so on with small scale datasets, about 10 hours one speaker. My total train steps is 1200k and finetune is base on https://huggingface.co/SWivid/F5-TTS
I have found many Pronunciation Problems in the models. I have also tried some models shared on huggingface, the wer is very low. So I want to know:

If I need more data for get a low wer model? And how long about it, my languages is not the same with the base model, not Chinese and English. I have only about 10 hours of my target speaker, if I add some opensource dataset, can it help for low wer?
any Hyperparameters suggestion for small scale dataset like 10 hours?
How to split the sentences? split by word, by char or by syllable?

SWivid · 2024-12-24T09:18:56Z

yes include more data e.g. opensource ones will help
up to experiment
not split is fine if not minutes-level records, otherwise 30s split by stop punctuation is fine

ILG2021 · 2024-12-24T09:38:42Z

Thanks for reply. 3 I mean tokenizer not label

SWivid · 2024-12-24T09:53:40Z

Thanks for reply. 3 I mean tokenizer not label

yep, you mean tokenizer for text input?

could try char first, which is the simplest way to do
if the utterance of language is very hard for model to learn, then syllable (say, need grapheme-to-phoneme), or syllable with bpe

ILG2021 added the question Further information is requested label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need best practice for dataset about 10 hours, one speaker #665

Need best practice for dataset about 10 hours, one speaker #665

ILG2021 commented Dec 24, 2024 •

edited

Loading

SWivid commented Dec 24, 2024

ILG2021 commented Dec 24, 2024

SWivid commented Dec 24, 2024 •

edited

Loading

Need best practice for dataset about 10 hours, one speaker #665

Need best practice for dataset about 10 hours, one speaker #665

Comments

ILG2021 commented Dec 24, 2024 • edited Loading

Checks

Question details

SWivid commented Dec 24, 2024

ILG2021 commented Dec 24, 2024

SWivid commented Dec 24, 2024 • edited Loading

ILG2021 commented Dec 24, 2024 •

edited

Loading

SWivid commented Dec 24, 2024 •

edited

Loading