Release v0.4.1 · fixie-ai/ultravox

We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions.

What's New

v0.4.1 improves upon 0.4 in the following ways:

We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).
We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).
Increased the amount of training data for English.

15 Languages Supported

Language	ISO Code
Arabic	ar
Chinese	zh
Dutch	nl
English	en
French	fr
German	de
Hindi	hi
Italian	it
Japanese	ja
Portuguese	pt
Russian	ru
Spanish	es
Swedish	sv
Turkish	tr
Ukrainian	uk

Evals

Our primary method of evaluation is speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca is an example of model performance for languages not included in training.

Ultravox 70B

	Ultravox 0.4 70B	Ultravox 0.4.1 70B
en_ar	14.97	19.64
en_de	30.30	32.47
es_en	39.55	40.76
ru_en	44.16	45.07
en_ca	35.02	37.58
zh_en	12.16	17.98

Ultravox 8B

	Ultravox 0.4 8B	Ultravox 0.4.1 8B
en_ar	11.17	12.28
en_de	25.47	27.13
es_en	37.11	39.16
ru_en	38.96	39.65
en_ca	27.46	29.94
zh_en	10.08	14.55

Training

This version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.

What's Changed

Bugfix: push_to_hub to use correct model to test by @farzadab in #98
Integrating OAI evals post training by @farzadab in #85
Make sure do_eval works without do_train by @farzadab in #100
Add AutoProcessor registration by @petersalas in #102
Support num epochs in config by @liPatrick in #90
Assert dataset length when using epochs by @liPatrick in #104
Add chunking to ds_tool by @liPatrick in #97
max_duration for Mosaic jobs by @farzadab in #112
Not uploading text_config when text_model_id is present by @farzadab in #108
[70B-Part1] Prefetch weights separately by @farzadab in #106
[Bugfix] Dot in output_dir causes evals to fail by @farzadab in #115
Update oaieval dependency by @farzadab in #114
Bugfix for path replace by @farzadab in #116
[70B-Part2] Improved save model (that can work with FSDP) by @farzadab in #107
[70B-Part3] FSDP Training by @farzadab in #109
[70B-Part4] Config and init_empty_weights by @farzadab in #117
Update README: use cases for Ultravox training by @farzadab in #118
Create test for config_base.py by @farzadab in #119
Using fixie-ai version of peoples_speech by @farzadab in #125
Dataset Tool to add Timestamps by @farzadab in #121

New Contributors

@petersalas made their first contribution in #102

Full Changelog: v0.4...v0.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.1