Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need best practice for dataset about 10 hours, one speaker #665

Open
4 tasks done
ILG2021 opened this issue Dec 24, 2024 · 3 comments
Open
4 tasks done

Need best practice for dataset about 10 hours, one speaker #665

ILG2021 opened this issue Dec 24, 2024 · 3 comments
Labels
question Further information is requested

Comments

@ILG2021
Copy link

ILG2021 commented Dec 24, 2024

Checks

  • This template is only for question, not feature requests or bug reports.
  • I have thoroughly reviewed the project documentation and read the related paper(s).
  • I have searched for existing issues, including closed ones, no similar questions.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Question details

Hello Community, thanks for your wonderful work of this nice project. I have encountered some problems with finetune and needs your help.

Recently I have tried some finetune tasks for languages like Thai, Laos, German and so on with small scale datasets, about 10 hours one speaker. My total train steps is 1200k and finetune is base on https://huggingface.co/SWivid/F5-TTS
I have found many Pronunciation Problems in the models. I have also tried some models shared on huggingface, the wer is very low. So I want to know:

  1. If I need more data for get a low wer model? And how long about it, my languages is not the same with the base model, not Chinese and English. I have only about 10 hours of my target speaker, if I add some opensource dataset, can it help for low wer?
  2. any Hyperparameters suggestion for small scale dataset like 10 hours?
  3. How to split the sentences? split by word, by char or by syllable?
@ILG2021 ILG2021 added the question Further information is requested label Dec 24, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 24, 2024

  1. yes include more data e.g. opensource ones will help
  2. up to experiment
  3. not split is fine if not minutes-level records, otherwise 30s split by stop punctuation is fine

@ILG2021
Copy link
Author

ILG2021 commented Dec 24, 2024

Thanks for reply. 3 I mean tokenizer not label

@SWivid
Copy link
Owner

SWivid commented Dec 24, 2024

Thanks for reply. 3 I mean tokenizer not label

yep, you mean tokenizer for text input?

could try char first, which is the simplest way to do
if the utterance of language is very hard for model to learn, then syllable (say, need grapheme-to-phoneme), or syllable with bpe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants