Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Tags < and > are removed during inference when text_frontend=True #743

Open
youngercloud opened this issue Dec 17, 2024 · 2 comments

Comments

@youngercloud
Copy link

Bug Report

Description

When setting text_frontend=True (or leaving it as the default), the < and > tags are removed from the text during inference.

For example:

INFO synthesis text 这也strong太strong离谱了吧!

To Reproduce

from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

# Initialize the CosyVoice2 model
cosyvoice = CosyVoice2(
    'pretrained_models/CosyVoice2-0.5B',
    load_jit=True,
    load_onnx=False,
    load_trt=False
)

audio_file_path = 'audio/48k.wav'
prompt_speech_16k = load_wav(audio_file_path, 16000)

for i, j in enumerate(cosyvoice.inference_cross_lingual(
    '这也<strong>太</strong>离谱了吧!',
    prompt_speech_16k,
    stream=False
)):
    torchaudio.save(
        'fine_grained_control_{}.wav'.format(i),
        j['tts_speech'],
        cosyvoice.sample_rate
    )

Expected Behavior

The inference should preserve the < and > tags as part of the input text.

这也<strong>太</strong>离谱了吧!

Actual Behavior

The tags < and > are removed, resulting in:

这也strong太strong离谱了吧

Environment

Please provide details about your environment:

  • Operating System: WSL2
  • Python Version: Python3.10 on conda
@darkacorn
Copy link

interesting .. maybe just a display thing as does work - on cross lingual that is

@youngercloud
Copy link
Author

@darkacorn I reinstalled the Cosyvoice2 step by step on a new Ubuntu-based machine, and I got the log below.

2024-12-20 15:17:49,499 INFO synthesis text 这也<strong>太</strong>。

In terms of actual inference, the model uses the post-processed string, which indicates an incorrect string, and outputs the wrong audio.

The reason of re-installing is that I wanted to see if WeTextProcessing works properly, and it works fine for the above sentence.

Anyway, I will set the text_frontend to False and leave this issue open for a while to see if someone encounters a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants