Ultravox 0.3 #78
zkoch
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey everyone,
We're officially making Ultravox 0.3 available today. The weights have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: https://fixie-ai.github.io/ultradox/
v0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better):
This version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.
In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.
The key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please get in touch.
We’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run
just gradio
inside of the Ultravox folder.Beta Was this translation helpful? Give feedback.
All reactions