[English|中文]
A simple Webui that allows you to inference VITS TTS models.
Also comes with API supports to interact with other processes.
- VITS Text-to-Speech
- GPU Acceleration
- Support for Multiple Models
- Automatic Language Recognition & Processing
- Customize Parameters
- Batch Processing for Long Text
- No longer hardcode the paths in Config. The project is now portable.
- Automatically load the model paths. No more manually editing the entry in Config.
- Prioritize PyTorch w/ Nvidia GPU Support. (Built on
CUDA 11.8
)Edit
requirements.txt
if using other CUDA versions - Should not throw issues when installing
fasttext
, at least on Windows - Clean up a few entries of the Config.
- Removed all Docker related stuffs...
- By default, only supports VITS models. You will need to edit the
config.py
and some other scripts to use VITS2, etc.
Some original features might be missing...
Open the console at the target location, then run the following:
git clone https://github.com/HaomingXR/vits-webui
- Create a virtual environment using the Python installed on your system (Tested on
3.10.10
)
python -m venv venv
venv\scripts\activate
- Download the self-contained Python runtime, Windows Embeddable Package
- Open the
python3<version>._pth
file (with a text editor) - Uncomment the
import site
line - Then, download and run get-pip.py to install
pip
Edit
requirements.txt
if using other CUDA versions, or not using Nvidia GPU
pip install -r requirements.txt
Run the following command to start the service:
python app.py
On Windows, you can also run webui.bat
to directly launch the service.
Edit the file and point to the Python runtime
- You may find various VITS models online, usually on
HuggingFace
spaces - Download the VITS model files (including both
.pth
and.json
files)
- Place both the model and config into their own folder, then place the folder inside the
models
directory - On launch, the system should automatically detect the models
The file config.py
contains a few default options. After launching the service for the first time,
it will generate a config.yaml
in the directory. All future launches will load this config instead.
The Admin Backend allows loading and unloading models, with login authentication.
For added security, you can just disable the backend in the config.yaml
:
'IS_ADMIN_ENABLED': !!bool 'false'
When enabled, it will automatically generate a pair of username and password in
config.yaml
You can enable this setting, so that the API usages require a key to connect.
'API_KEY_ENABLED': !!bool 'false'
When enabled, it will automatically generate a random key in
config.yaml
You can edit this setting to set the local server port for the API.
'PORT': !!int '8888'
- Return the dictionary mapping of IDs to Speaker
GET http://127.0.0.1:8888/voice/speakers
- Return the audio data speaking prompt
default parameters are used when not specified
GET http://127.0.0.1:8888/voice/vits?text=prompt
VITS
Parameter | Required | Default Value | Type | Instruction |
---|---|---|---|---|
text | true | str | Text to speak | |
id | false | From config.yaml |
int | Speaker ID |
format | false | From config.yaml |
str | wav / ogg / mp3 / flac |
lang | false | From config.yaml |
str | The language of the text to be synthesized |
length | false | From config.yaml |
float | The length of the synthesized speech. The larger the value, the slower the speed. |
noise | false | From config.yaml |
float | The randomness of the synthesis |
noisew | false | From config.yaml |
float | The length of phoneme pronunciation |
segment_size | false | From config.yaml |
int | Divide the text into paragraphs based on punctuation marks |
streaming | false | false | bool | Stream synthesized speech with faster initial response |
Check the original repo for more info
- vits: https://github.com/jaywalnut310/vits
- MoeGoe: https://github.com/CjangCjengh/MoeGoe
- vits-uma-genshin-honkai: https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai
- vits-models: https://huggingface.co/spaces/zomehwh/vits-models