Blank Audio #669

Zenger · 2024-12-26T01:05:28Z

Checks

This template is only for bug reports, usage problems go with 'Help Wanted'.
I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
I have searched for existing issues, including closed ones, and couldn't find a solution.
I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

Any attempt at generating audio results in blank audio of exactly 00:00:01 sec long.

My current test environment setup is :

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:        22.04
Codename:       jammy

nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1660 Ti (UUID: GPU-6c3a2f0f-e6a7-d109-a2d1-9dadb163c88e)

python3 --version
Python 3.10.12

pip3 list | grep torch
ema-pytorch                   0.7.7
torch                         2.5.1+cu121
torchaudio                    2.5.1+cu121
torchdiffeq                   0.2.5

pip3 list | grep cuda
nvidia-cuda-cupti-cu11        11.8.87
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu11        11.8.89
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu11      11.8.89
nvidia-cuda-runtime-cu12      12.1.105

Steps to Reproduce

python3 -m venv .venv
source .venv/bin/activate
pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121 

# I also tried 
# pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

f5-tts_infer-cli --model "F5-TTS" --ref_audio ../in.wav --ref_text "You were the one responsible for sinking of the Ro
yal Fleet. He wants me to bring you before him. Alive. And who am I speaking with? I need a guard sword. Any ideas?" --g
en_text "Where is my brother? Answer me!"
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.517 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.

Download Vocos from huggingface charactr/vocos-mel-24khz
Using F5-TTS...

vocab :  /home/zenger/Documents/F5-TTS/src/f5_tts/infer/examples/vocab.txt
token :  custom
model :  /home/zenger/.cache/huggingface/hub/models--SWivid--F5-TTS/snapshots/4dcc16f297f2ff98a17b3726b16f5de5a5e45672/F5TTS_Base/model_1200000.safetensors

Voice: main
ref_audio  ../in.wav
Converting audio...
Using custom reference text...

ref_text   You were the one responsible for sinking of the Royal Fleet. He wants me to bring you before him. Alive. And who am I speaking with? I need a guard sword. Any ideas?.
ref_audio_ /tmp/tmpist0frdt.wav


No voice tag found, using main.
Voice: main
gen_text 0 Where is my brother? Answer me!


Generating audio in 1 batches...
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.93s/it]
tests/infer_cli_basic.wav

No visible issues that I can see

✔️ Expected Behavior

Expected TTS audio.

❌ Actual Behavior

I did attempt the fix fp32 from #356 changed:

def initialize_asr_pipeline(device: str = device, dtype=None):
    if dtype is None:
        dtype = (
            torch.float16
            if "cuda" in device
            and torch.cuda.get_device_properties(device).major >= 6
            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
            else torch.float32
        )
    dtype = torch.float32 # <---- Forced fp32 here 
    global asr_pipe
    asr_pipe = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-large-v3-turbo",
        torch_dtype=dtype,
        device=device,
    )

   
  def load_checkpoint(model, ckpt_path, device: str, dtype=None, use_ema=True):
    if dtype is None:
        dtype = (
            torch.float16
            if "cuda" in device
            and torch.cuda.get_device_properties(device).major >= 6
            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
            else torch.float32
        )
    dtype = torch.float32 # <---- Forced fp32 here
    model = model.to(dtype)

Without any success.
Running it as CLI outputs:

f5-tts_infer-cli --model "F5-TTS" --ref_audio ../in.wav --ref_text "You were the one responsible for sinking of the Ro
yal Fleet. He wants me to bring you before him. Alive. And who am I speaking with? I need a guard sword. Any ideas?" --g
en_text "Where is my brother? Answer me!"
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.517 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.

Download Vocos from huggingface charactr/vocos-mel-24khz
Using F5-TTS...

vocab :  /home/zenger/Documents/F5-TTS/src/f5_tts/infer/examples/vocab.txt
token :  custom
model :  /home/zenger/.cache/huggingface/hub/models--SWivid--F5-TTS/snapshots/4dcc16f297f2ff98a17b3726b16f5de5a5e45672/F5TTS_Base/model_1200000.safetensors

Voice: main
ref_audio  ../in.wav
Converting audio...
Using custom reference text...

ref_text   You were the one responsible for sinking of the Royal Fleet. He wants me to bring you before him. Alive. And who am I speaking with? I need a guard sword. Any ideas?.
ref_audio_ /tmp/tmpist0frdt.wav


No voice tag found, using main.
Voice: main
gen_text 0 Where is my brother? Answer me!


Generating audio in 1 batches...
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.93s/it]
tests/infer_cli_basic.wav

No visible issues that I can see

and running with Gradio outputs no visible issues as well, yet no audio:

f5-tts_infer-gradio --port 7777 --host 0.0.0.0
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.487 seconds.
Prefix dict has been built successfully.
Word segmentation module jieba initialized.

Download Vocos from huggingface charactr/vocos-mel-24khz

vocab :  /home/zenger/Documents/F5-TTS/src/f5_tts/infer/examples/vocab.txt
token :  custom
model :  /home/zenger/.cache/huggingface/hub/models--SWivid--F5-TTS/snapshots/4dcc16f297f2ff98a17b3726b16f5de5a5e45672/F5TTS_Base/model_1200000.safetensors

/home/zenger/Documents/F5-TTS/.venv/lib/python3.10/site-packages/gradio/components/chatbot.py:242: UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
  warnings.warn(
Starting app...
* Running on local URL:  http://0.0.0.0:7777

To create a public link, set `share=True` in `launch()`.

ref_text   You were the one responsible for sinking of the Royal Fleet. He wants me to bring you before him. Alive. And who am I speaking with? I need a guard sword. Any ideas?.
gen_text 0 Where is my brother?


/home/zenger/Documents/F5-TTS/.venv/lib/python3.10/site-packages/gradio/processing_utils.py:738: UserWarning: Trying to convert audio automatically from float32 to 16-bit int format.
  warnings.warn(warning.format(data.dtype))

Unless I should also change gradio processing_utils.py, I would think it shouldn't matter.

There was another issue where id get Unknown encoder 'pcm_s4le' but running ffmpeg -i input.wav -c:a pcm_s16le -ar 16000 output.wav on my sample seemed to have gotten rid of that problem.

I'd be more than happy to supply more details if needed.

Thank you

The text was updated successfully, but these errors were encountered:

micedevai · 2024-12-26T02:30:53Z

Blank Audio Output with F5-TTS Model

When attempting to generate audio using the F5-TTS model, the resulting audio output is always a blank file with a duration of exactly 00:00:01 seconds. Despite following the installation and usage instructions, including troubleshooting steps (such as forcing fp32 precision), no audio is generated, and no visible errors are encountered.

The issue occurs both when using the command-line interface (f5-tts_infer-cli) and when running the model through a Gradio interface. Below are the system details and steps to reproduce the issue.

Environment Details:

Operating System: Ubuntu 22.04.5 LTS
GPU: NVIDIA GeForce GTX 1660 Ti
Python Version: Python 3.10.12
CUDA Version: 12.1
Torch and Torchaudio Versions:
- torch==2.5.1+cu121
- torchaudio==2.5.1+cu121

Steps to Reproduce:

Set up a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install the required dependencies:

pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

Clone the repository and install the package:

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

Attempt to generate audio using the following command:

f5-tts_infer-cli --model "F5-TTS" --ref_audio ../in.wav --ref_text "You were the one responsible for sinking of the Royal Fleet..." --gen_text "Where is my brother? Answer me!"

Run Gradio interface to attempt audio generation:
```
f5-tts_infer-gradio --port 7777 --host 0.0.0.0
```

Despite these steps, the output audio file (tests/infer_cli_basic.wav or through Gradio) is blank, and has a duration of 00:00:01.

Expected Behavior:

The TTS system should generate and output speech audio from the provided reference audio and text.

Actual Behavior:

No audio is generated. The file output is a 1-second-long blank audio file, and there are no error messages or indications of failure in the console or logs.

Troubleshooting Steps Taken:

Forced fp32 Precision: As suggested in Error in inference. Audio output with no content, all silence #356, the code was modified to force fp32 precision, but this did not resolve the issue.
```
dtype = torch.float32  # Forced fp32 precision
```
FFmpeg Issues: I encountered an issue related to the pcm_s4le codec, but I resolved it by converting the input WAV file using:
```
ffmpeg -i input.wav -c:a pcm_s16le -ar 16000 output.wav
```
No Visible Errors: Both the command-line output and Gradio interface display no errors. The logs indicate that the model is loading and processing the input correctly, but no valid audio is produced.

Additional Information:

Gradio Warning: The following warning appeared when running Gradio:
```
/home/zenger/Documents/F5-TTS/.venv/lib/python3.10/site-packages/gradio/processing_utils.py:738: UserWarning: Trying to convert audio automatically from float32 to 16-bit int format.
```
This might be related to the issue, but I believe the problem is not with Gradio but rather with the audio generation process.

Suggested Next Steps:

Check Model Output: Ensure that the model's output is being correctly processed before being converted into an audio file. Investigate the tensor values or any potential empty outputs.
Model Validation: Test the model on a simpler case or with a different input audio/text combination to rule out potential issues with specific inputs.
Audio Saving Mechanism: Investigate the code responsible for saving the generated audio. It might be prematurely closing the audio file or encountering a silent signal.

If any further details are required, or if you'd like specific logs or configurations, I'd be happy to supply them.

SWivid · 2024-12-26T05:34:10Z

ref_audio_ /tmp/tmpist0frdt.wav

try ctrl+left_click on /tmp/tmpist0frdt.wav, check if its a normal audio one (e.g. how many seconds of this sample and does it contain proper content that matches the ref_text?)

Zenger · 2024-12-27T02:53:48Z

The tmp file does have audio, althought 11ms shorter, but it does play and sound like the original in file

ffmpeg -i tmpist0frdt.wav -hide_banner
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from '.\tmpist0frdt.wav':
  Duration: 00:00:08.35, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s

and the original in.wav for reference

ffmpeg -i in.wav -hide_banner
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from '.\in.wav':
Metadata:
  encoder         : Lavf59.16.100
Duration: 00:00:08.46, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
At least one output file must be specified

SWivid · 2024-12-27T07:07:18Z

emmm, so the input is actually fine

how is the loading result, e.g. can torchaudio properly load the audio?

F5-TTS/src/f5_tts/infer/utils_infer.py

Line 376 in 3e73553

audio, sr = torchaudio.load(ref_audio)

could try print out audio see if has proper content inside or just null

if null, maybe it's version conflict between ffmpeg/sox-io/other audio backend and torchaudio (or reinstall backend might help

Zenger added the bug Something isn't working label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blank Audio #669

Blank Audio #669

Zenger commented Dec 26, 2024 •

edited

Loading

micedevai commented Dec 26, 2024

SWivid commented Dec 26, 2024

Zenger commented Dec 27, 2024

SWivid commented Dec 27, 2024

Blank Audio #669

Blank Audio #669

Comments

Zenger commented Dec 26, 2024 • edited Loading

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

micedevai commented Dec 26, 2024

Environment Details:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Troubleshooting Steps Taken:

Additional Information:

Suggested Next Steps:

SWivid commented Dec 26, 2024

Zenger commented Dec 27, 2024

SWivid commented Dec 27, 2024

Zenger commented Dec 26, 2024 •

edited

Loading