Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradio demo for real-time conversations with WebRTC #150

Merged
merged 9 commits into from
Dec 12, 2024

Conversation

freddyaboulton
Copy link
Contributor

This PR adds a gradio demo for real-time conversations with the latest ultravox model. The gradio demo leverages the WebRTC custom component for low-latency audio streaming both locally and on remote servers like EC2 and huggingface spaces.

You can see the demo running here

ultravox-demo.mp4

Key Features:

  • WebRTC for low latency streaming no matter where the demo runs
  • Automatic Voice Detection: Once a pause is detected, audio is passed to the custom python function to generate a response
  • Intuitive UI: The entire conversation is displayed in a chatbot UI
  • Inference Example: Shows how to run inference for the model (Add basic Jupyter notebook for inference #11)

Improvements (need help from community/model authors!):

  • Its unclear (to me at least) how to stream outputs from the transformers pipeline. Streaming the output can make the demo look like it's running faster by reducing the time until output is shown. Changing the gradio demo to stream the text is trivial once I know how to stream the text from the pipeline.
  • It's unclear how to properly handle multi-turn audio conversations with this model. So I'm using whisper to transcribe the user input so that I can store it in the chat history for the next turn. This is increasing latency so I would like to get rid of it if possible.

@eschmidbauer
Copy link

@freddyaboulton just curious- why are you using whisper to transcribe?
ultravox is capable of taking audio as input. see ultravox/tools/gradio_demo.py for example

@freddyaboulton
Copy link
Contributor Author

Hi @eschmidbauer ! The audio is passed directly to ultravox in my demo but I used whisper to pass the previous audio prompts as text in the turns parameter. I was going by this code and it seems the turns can only be text. In the demo you linked, it seems that only the current audio message is taken into consideration and the previous audio messages are not used to generate a response. Is that correct? Please correct me if I'm wrong. I'll be happy to modify the demo to follow the best practices!

@freddyaboulton
Copy link
Contributor Author

BTW the issue with reload mode loading models twice has since been fixed!

@eschmidbauer
Copy link

eschmidbauer commented Nov 22, 2024

@freddyaboulton
Do you still need to transcribe and add it to the turn when conversation_mode=True

https://github.com/fixie-ai/ultravox/blob/main/ultravox/tools/gradio_helper.py#L15C13-L15C36

@farzadab
Copy link
Contributor

farzadab commented Nov 26, 2024

Thanks @freddyaboulton!

Two quick replies to your questions:

Its unclear (to me at least) how to stream outputs from the transformers pipeline

AFAIU the pipeline abstraction is not designed for streaming use cases. I might be wrong though. I'll take a look. In the meantime you can take a look at infer_tool which supports both batched and streaming modes: https://github.com/fixie-ai/ultravox/blob/main/ultravox/tools/infer_tool.py#L83

We've had other issues with pipeline as well, for example regarding batched processing, so the current pipeline implementation is lacking in many ways.

It's unclear how to properly handle multi-turn audio conversations with this model.

Yes, the current pipeline implementation doesn't support this and it's long overdue. I can spend some time on this as soon as I can.

@zqhuang211
Copy link
Contributor

@freddyaboulton Thanks for submitting this PR—it looks great! Regarding your two questions about streaming output and multi-turn conversation, both are supported in the Gradio demo implemented in ultravox/tools/gradio_demo.py (which requires start/stop recording for each user audio input). I was wondering if it might be possible to adapt some of the ideas from that demo into your Gradio demo?

@freddyaboulton
Copy link
Contributor Author

Hi @zqhuang211 ! Yes that is a great plan - will update my demo this week :)

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Dec 7, 2024

Hi @zqhuang211 , @farzadab, @eschmidbauer - I have updated the demo to use the infer_tool. You can run it with poetry run python ultravox/tools/gradio_demo.py --voice_mode=True

@freddyaboulton
Copy link
Contributor Author

2024-12-06.16-15-56.mp4

Copy link
Contributor

@zqhuang211 zqhuang211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thank you!

I will make some minor changes from my end.

@zqhuang211
Copy link
Contributor

@freddyaboulton there are some minor formatting issues. Can you run just check and just format to make sure it can pass the tests? I will merge it afterwards. Thanks!

@freddyaboulton
Copy link
Contributor Author

Should be fixed @zqhuang211 -thanks!

@farzadab farzadab merged commit 87acdb8 into fixie-ai:main Dec 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants