Skip to content

Commit

Permalink
Adding NVIDIA TTS models: FastPitch and HiFiGAN (#323)
Browse files Browse the repository at this point in the history
  • Loading branch information
nv-kkudrynski authored May 22, 2023
1 parent 0c7a13d commit 12d8df1
Show file tree
Hide file tree
Showing 4 changed files with 312 additions and 0 deletions.
Binary file added images/fastpitch_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/hifigan_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
159 changes: 159 additions & 0 deletions nvidia_deeplearningexamples_fastpitch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
layout: hub_detail
background-class: hub-background
body-class: hub
title: FastPitch 2
summary: The FastPitch model for generating mel spectrograms from text
category: researchers
image: nvidia_logo.png
author: NVIDIA
tags: [audio]
github-link: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
github-id: NVIDIA/DeepLearningExamples
featured_image_1: fastpitch_model.png
featured_image_2: no-image
accelerator: cuda
---


### Model Description

This notebook demonstrates a PyTorch implementation of the FastPitch model described in the [FastPitch](https://arxiv.org/abs/2006.06873) paper.
The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text. In version 1.1, it does not need any pre-trained aligning model to bootstrap from. To get the audio waveform we need a second model that will produce it from the generated mel-spectrogram. In this notebook we use HiFi-GAN model for that second step.

The FastPitch model is based on the [FastSpeech](https://arxiv.org/abs/1905.09263) model. The main differences between FastPitch vs FastSpeech are as follows:
* no dependence on external aligner (Transformer TTS, Tacotron 2); in version 1.1, FastPitch aligns audio to transcriptions by itself as in [One TTS Alignment To Rule Them All](https://arxiv.org/abs/2108.10447),
* FastPitch explicitly learns to predict the pitch contour,
* pitch conditioning removes harsh sounding artifacts and provides faster convergence,
* no need for distilling mel-spectrograms with a teacher model,
* capabilities to train a multi-speaker model.


#### Model architecture

![FastPitch Architecture](https://raw.githubusercontent.com/NVIDIA/DeepLearningExamples/master/PyTorch/SpeechSynthesis/FastPitch/img/fastpitch_model.png)

### Example
In the example below:

- pretrained FastPitch and HiFiGAN models are loaded from torch.hub
- given tensor representation of an input text ("Say this smoothly to prove you are not a robot."), FastPitch generates mel spectrogram
- HiFiGAN generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed. These are needed for preprocessing of text and audio, as well as for display and input/output handling. Finally, for better performance of FastPitch model, we download the CMU pronounciation dictionary.
```bash
apt-get update
apt-get install -y libsndfile1 wget
pip install numpy scipy librosa unidecode inflect librosa matplotlib==3.6.3
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/heteronyms-052722 -qO heteronyms
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/cmudict-0.7b_nv22.08 -qO cmudict-0.7b
```

```python
import torch
import matplotlib.pyplot as plt
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f'Using {device} for inference')
```

Download and setup FastPitch generator model.
```python
fastpitch, generator_train_setup = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_fastpitch')
```

Download and setup vocoder and denoiser models.
```python
hifigan, vocoder_train_setup, denoiser = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_hifigan')
```

Verify that generator and vocoder models agree on input parameters.
```python
CHECKPOINT_SPECIFIC_ARGS = [
'sampling_rate', 'hop_length', 'win_length', 'p_arpabet', 'text_cleaners',
'symbol_set', 'max_wav_value', 'prepend_space_to_text',
'append_space_to_text']

for k in CHECKPOINT_SPECIFIC_ARGS:

v1 = generator_train_setup.get(k, None)
v2 = vocoder_train_setup.get(k, None)

assert v1 is None or v2 is None or v1 == v2, \
f'{k} mismatch in spectrogram generator and vocoder'
```

Put all models on available device.
```python
fastpitch.to(device)
hifigan.to(device)
denoiser.to(device)
```

Load text processor.
```python
tp = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_textprocessing_utils', cmudict_path="cmudict-0.7b", heteronyms_path="heteronyms")
```

Set the text to be synthetized, prepare input and set additional generation parameters.
```python
text = "Say this smoothly, to prove you are not a robot."
```

```python
batches = tp.prepare_input_sequence([text], batch_size=1)
```

```python
gen_kw = {'pace': 1.0,
'speaker': 0,
'pitch_tgt': None,
'pitch_transform': None}
denoising_strength = 0.005
```

```python
for batch in batches:
with torch.no_grad():
mel, mel_lens, *_ = fastpitch(batch['text'].to(device), **gen_kw)
audios = hifigan(mel).float()
audios = denoiser(audios.squeeze(1), denoising_strength)
audios = audios.squeeze(1) * vocoder_train_setup['max_wav_value']

```

Plot the intermediate spectorgram.
```python
plt.figure(figsize=(10,12))
res_mel = mel[0].detach().cpu().numpy()
plt.imshow(res_mel, origin='lower')
plt.xlabel('time')
plt.ylabel('frequency')
_=plt.title('Spectrogram')
```

Syntesize audio.
```python
audio_numpy = audios[0].cpu().numpy()
Audio(audio_numpy, rate=22050)
```

Write audio to wav file.
```python
from scipy.io.wavfile import write
write("audio.wav", vocoder_train_setup['sampling_rate'], audio_numpy)
```

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/HiFiGAN) and/or [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/fastpitch_pyt)

### References

- [FastPitch paper](https://arxiv.org/abs/2006.06873)
- [FastPitch on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/fastpitch_pyt)
- [HiFi-GAN on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/hifigan_pyt)
- [FastPitch and HiFi-GAN on github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/HiFiGAN)
153 changes: 153 additions & 0 deletions nvidia_deeplearningexamples_hifigan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
layout: hub_detail
background-class: hub-background
body-class: hub
title: HiFi GAN
summary: The HiFi GAN model for generating waveforms from mel spectrograms
category: researchers
image: nvidia_logo.png
author: NVIDIA
tags: [audio]
github-link: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
github-id: NVIDIA/DeepLearningExamples
featured_image_1: hifigan_model.png
featured_image_2: no-image
accelerator: cuda
---


### Model Description
This notebook demonstrates a PyTorch implementation of the HiFi-GAN model described in the paper: [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646).
The HiFi-GAN model implements a spectrogram inversion model that allows to synthesize speech waveforms from mel-spectrograms. It follows the generative adversarial network (GAN) paradigm, and is composed of a generator and a discriminator. After training, the generator is used for synthesis, and the discriminator is discarded.

Our implementation is based on the one [published by the authors of the paper](https://github.com/jik876/hifi-gan). We modify the original hyperparameters and provide an alternative training recipe, which enables training on larger batches and faster convergence. HiFi-GAN is trained on a publicly available [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/). The <a href="audio/">samples</a> demonstrate speech synthesized with our publicly available FastPitch and HiFi-GAN checkpoints.

#### Model architecture

![HiFiGAN Architecture](https://raw.githubusercontent.com/NVIDIA/DeepLearningExamples/master/PyTorch/SpeechSynthesis/HiFiGAN/img/hifigan_model.png)

### Example
In the example below:

- pretrained FastPitch and HiFiGAN models are loaded from torch.hub
- given tensor representation of an input text ("Say this smoothly to prove you are not a robot."), FastPitch generates mel spectrogram
- HiFiGAN generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed. These are needed for preprocessing of text and audio, as well as for display and input/output handling. Finally, for better performance of FastPitch model, we download the CMU pronounciation dictionary.
```bash
pip install numpy scipy librosa unidecode inflect librosa matplotlib==3.6.3
apt-get update
apt-get install -y libsndfile1 wget
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/heteronyms-052722 -qO heteronyms
wget https://raw.githubusercontent.com/NVIDIA/NeMo/263a30be71e859cee330e5925332009da3e5efbc/scripts/tts_dataset_files/cmudict-0.7b_nv22.08 -qO cmudict-0.7b
```

```python
import torch
import matplotlib.pyplot as plt
from IPython.display import Audio
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f'Using {device} for inference')
```

Download and setup FastPitch generator model.
```python
fastpitch, generator_train_setup = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_fastpitch')
```

Download and setup vocoder and denoiser models.
```python
hifigan, vocoder_train_setup, denoiser = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_hifigan')
```

Verify that generator and vocoder models agree on input parameters.
```python
CHECKPOINT_SPECIFIC_ARGS = [
'sampling_rate', 'hop_length', 'win_length', 'p_arpabet', 'text_cleaners',
'symbol_set', 'max_wav_value', 'prepend_space_to_text',
'append_space_to_text']

for k in CHECKPOINT_SPECIFIC_ARGS:

v1 = generator_train_setup.get(k, None)
v2 = vocoder_train_setup.get(k, None)

assert v1 is None or v2 is None or v1 == v2, \
f'{k} mismatch in spectrogram generator and vocoder'
```

Put all models on available device.
```python
fastpitch.to(device)
hifigan.to(device)
denoiser.to(device)
```

Load text processor.
```python
tp = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_textprocessing_utils', cmudict_path="cmudict-0.7b", heteronyms_path="heteronyms")
```

Set the text to be synthetized, prepare input and set additional generation parameters.
```python
text = "Say this smoothly, to prove you are not a robot."
```

```python
batches = tp.prepare_input_sequence([text], batch_size=1)
```

```python
gen_kw = {'pace': 1.0,
'speaker': 0,
'pitch_tgt': None,
'pitch_transform': None}
denoising_strength = 0.005
```

```python
for batch in batches:
with torch.no_grad():
mel, mel_lens, *_ = fastpitch(batch['text'].to(device), **gen_kw)
audios = hifigan(mel).float()
audios = denoiser(audios.squeeze(1), denoising_strength)
audios = audios.squeeze(1) * vocoder_train_setup['max_wav_value']

```

Plot the intermediate spectorgram.
```python
plt.figure(figsize=(10,12))
res_mel = mel[0].detach().cpu().numpy()
plt.imshow(res_mel, origin='lower')
plt.xlabel('time')
plt.ylabel('frequency')
_=plt.title('Spectrogram')
```

Syntesize audio.
```python
audio_numpy = audios[0].cpu().numpy()
Audio(audio_numpy, rate=22050)
```

Write audio to wav file.
```python
from scipy.io.wavfile import write
write("audio.wav", vocoder_train_setup['sampling_rate'], audio_numpy)
```

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/HiFiGAN) and/or [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/hifigan_pyt)

### References

- [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
- [Original implementation](https://github.com/jik876/hifi-gan)
- [FastPitch on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/fastpitch_pyt)
- [HiFi-GAN on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/resources/hifigan_pyt)
- [FastPitch and HiFi-GAN on github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/HiFi-GAN)

0 comments on commit 12d8df1

Please sign in to comment.