We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.
We'd love to hear feedback on your experience with Ultravox, along with feature suggestions.
What's New
v0.4.1 improves upon 0.4 in the following ways:
- We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).
- We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).
- Increased the amount of training data for English.
15 Languages Supported
Language | ISO Code |
---|---|
Arabic | ar |
Chinese | zh |
Dutch | nl |
English | en |
French | fr |
German | de |
Hindi | hi |
Italian | it |
Japanese | ja |
Portuguese | pt |
Russian | ru |
Spanish | es |
Swedish | sv |
Turkish | tr |
Ukrainian | uk |
Evals
Our primary method of evaluation is speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca
is an example of model performance for languages not included in training.
Ultravox 70B
Ultravox 0.4 70B | Ultravox 0.4.1 70B | |
---|---|---|
en_ar | 14.97 | 19.64 |
en_de | 30.30 | 32.47 |
es_en | 39.55 | 40.76 |
ru_en | 44.16 | 45.07 |
en_ca | 35.02 | 37.58 |
zh_en | 12.16 | 17.98 |
Ultravox 8B
Ultravox 0.4 8B | Ultravox 0.4.1 8B | |
---|---|---|
en_ar | 11.17 | 12.28 |
en_de | 25.47 | 27.13 |
es_en | 37.11 | 39.16 |
ru_en | 38.96 | 39.65 |
en_ca | 27.46 | 29.94 |
zh_en | 10.08 | 14.55 |
Training
This version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.
What's Changed
- Bugfix: push_to_hub to use correct model to test by @farzadab in #98
- Integrating OAI evals post training by @farzadab in #85
- Make sure do_eval works without do_train by @farzadab in #100
- Add AutoProcessor registration by @petersalas in #102
- Support num epochs in config by @liPatrick in #90
- Assert dataset length when using epochs by @liPatrick in #104
- Add chunking to ds_tool by @liPatrick in #97
- max_duration for Mosaic jobs by @farzadab in #112
- Not uploading text_config when text_model_id is present by @farzadab in #108
- [70B-Part1] Prefetch weights separately by @farzadab in #106
- [Bugfix] Dot in output_dir causes evals to fail by @farzadab in #115
- Update oaieval dependency by @farzadab in #114
- Bugfix for path replace by @farzadab in #116
- [70B-Part2] Improved save model (that can work with FSDP) by @farzadab in #107
- [70B-Part3] FSDP Training by @farzadab in #109
- [70B-Part4] Config and init_empty_weights by @farzadab in #117
- Update README: use cases for Ultravox training by @farzadab in #118
- Create test for config_base.py by @farzadab in #119
- Using fixie-ai version of peoples_speech by @farzadab in #125
- Dataset Tool to add Timestamps by @farzadab in #121
New Contributors
- @petersalas made their first contribution in #102
Full Changelog: v0.4...v0.4.1