How to fine tune Vits En TTS model for a different language ? #5922

Ca-ressemble-a-du-fake · 2023-02-04T02:35:12Z

Ca-ressemble-a-du-fake
Feb 4, 2023

Hi,

I need Vits to speak French (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_vits). So I need to fine tune it with my own dataset.

I read the tutorial on Fastpitch fine tuning (https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb) which clearly depicts the fune tuning process.

But now I don't know where to go to fine tune (or even train Vits from scratch) on my French dataset.

Any guide or tutorial would help me a lot !

redoctopus · 2023-02-06T21:29:12Z

redoctopus
Feb 6, 2023
Collaborator

@treacker can you provide some advice for training/fine-tuning VITS?

2 replies

lumpidu Feb 8, 2023

When you are already at it .... could you also document multi-speaker VITS ?

treacker Feb 21, 2023

When you are already at it .... could you also document multi-speaker VITS ?

Multi-speaker VITS is not ready yet

Rumeysakeskin · 2023-02-21T07:12:55Z

Rumeysakeskin
Feb 21, 2023

@Ca-ressemble-a-du-fake,
I worked with Nemo Fastpitch for my Turkish dataset. I shared my project and all step from setup to inference in my repo. I hope this will help you!

1 reply

Ca-ressemble-a-du-fake Feb 23, 2023
Author

Thanks a lot @Rumeysakeskin the detailled sections "Text preprocessing" and "Data preparation" of your repo are very helpful!

treacker · 2023-02-21T07:28:17Z

treacker
Feb 21, 2023

Hi,

I need Vits to speak French (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_lj_vits). So I need to fine tune it with my own dataset.

I haven't researched finetuning, but I think the process will be pretty much the same. This checkpoint was trained with IPA tokenizer, so TextEmbedding will likely still work, but check if your French phonemes set does not exceed NeMo IPA set

2 replies

Ca-ressemble-a-du-fake Feb 23, 2023
Author

Thank you @treacker for your reply. Where is NeMo IPA set defined I could not find it in source code ?

treacker Feb 23, 2023

Watch for correct names which are being passed to g2p module

NeMo/nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py

Line 50 in 3b0a814

LATIN_CHARS_ALL = f"{LATIN_ALPHABET_BASIC}{ACCENTED_CHARS}"

Asrithapenmetsa · 2024-11-18T12:02:36Z

Asrithapenmetsa
Nov 18, 2024

Hi,

Currently I am working on VITS Model to finetune in my own voice but I am getting the following error. I don't know exactly what causes this error but I think it is the padding issue.Even though I've done some changes in the code this error is encountering again and again.

input_values = [item["input_values"] for item in batch] KeyError: 'input_values' 0%| | 0/40 [00:00<?, ?it/s]
I am getting error at trainer.train() and my audio samples are padded in this way but the error shows missing input_values
0%| | 0/40 [00:00<?, ?it/s]Sample 12: {'input_values': tensor([0., 0., 0., ..., 0., 0., 0.]), 'labels': tensor([ 0, 44, 0, 156, 0, 135, 0, 53, 0, 61, 0, 16, 0, 69, 0, 158, 0, 123, 0, 16, 0, 48, 0, 156, 0, 135, 0, 54, 0, 16, 0, 138, 0, 64, 0, 16, 0, 61, 0, 62, 0, 156, 0, 57, 0, 158, 0, 123, 0, 51, 0, 68, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} Sample 10: {'input_values': tensor([0., 0., 0., ..., 0., 0., 0.]), 'labels': tensor([ 0, 62, 0, 156, 0, 43, 0, 102, 0, 55, 0, 16, 0, 48, 0, 54, 0, 156, 0, 43, 0, 102, 0, 68, 0, 16, 0, 65, 0, 86, 0, 56, 0, 16, 0, 52, 0, 135, 0, 123, 0, 16, 0, 50, 0, 157, 0, 72, 0, 64, 0, 102, 0, 112, 0, 16, 0, 48, 0, 156, 0, 138, 0, 56, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} Missing 'input_values' in sample: {'labels': tensor([ 0, 44, 0, 156, 0, 135, 0, 53, 0, 61, 0, 16, 0, 69, 0, 158, 0, 123, 0, 16, 0, 48, 0, 156, 0, 135, 0, 54, 0, 16, 0, 138, 0, 64, 0, 16, 0, 61, 0, 62, 0, 156, 0, 57, 0, 158, 0, 123, 0, 51, 0, 68, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} Missing 'input_values' in sample: {'labels': tensor([ 0, 62, 0, 156, 0, 43, 0, 102, 0, 55, 0, 16, 0, 48, 0, 54, 0, 156, 0, 43, 0, 102, 0, 68, 0, 16, 0, 65, 0, 86, 0, 56, 0, 16, 0, 52, 0, 135, 0, 123, 0, 16, 0, 50, 0, 157, 0, 72, 0, 64, 0, 102, 0, 112, 0, 16, 0, 48, 0, 156, 0, 138, 0, 56, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

It would help a lot can someone guide me on this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fine tune Vits En TTS model for a different language ? #5922

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to fine tune Vits En TTS model for a different language ? #5922

Ca-ressemble-a-du-fake Feb 4, 2023

Replies: 4 comments · 5 replies

redoctopus Feb 6, 2023 Collaborator

lumpidu Feb 8, 2023

treacker Feb 21, 2023

Rumeysakeskin Feb 21, 2023

Ca-ressemble-a-du-fake Feb 23, 2023 Author

treacker Feb 21, 2023

Ca-ressemble-a-du-fake Feb 23, 2023 Author

treacker Feb 23, 2023

Asrithapenmetsa Nov 18, 2024

Ca-ressemble-a-du-fake
Feb 4, 2023

Replies: 4 comments 5 replies

redoctopus
Feb 6, 2023
Collaborator

Rumeysakeskin
Feb 21, 2023

Ca-ressemble-a-du-fake Feb 23, 2023
Author

treacker
Feb 21, 2023

Ca-ressemble-a-du-fake Feb 23, 2023
Author

Asrithapenmetsa
Nov 18, 2024