Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

farzadab · 2024-10-01T18:20:18Z

This PR is not in a state to be merged, but it shows how Llama 3.2 can be used to combine image, text, and audio inputs together and get the correct response.

Take a look at llama32_script.py to see how this is done:

llama 3.2 11B vision instruct model is loaded
weights from ultravox-v0_4 (trained on llama 3.1 8B) are loaded without modification
input consists of image and audio

Note: before using the script, a few lines in the transformers library need to be manually commented out to allow for this approach:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/modeling_mllama.py#L2152-L2155
These lines don't allow you to specify inputs_embeds when vision input is present. Hopefully we can upstream this change in the future.

kadirnar · 2024-12-04T11:37:03Z

@farzadab , Are there any updates on this issue?

farzadab · 2024-12-04T17:48:36Z

@kadirnar this was simply a proof of concept. Unfortunately, combining vision into Ultravox is not part of our roadmap.

What is your use-case?

kadirnar · 2024-12-04T21:52:32Z

@kadirnar this was simply a proof of concept. Unfortunately, combining vision into Ultravox is not part of our roadmap.

What is your use-case?

I want to create a model like this. However, the gpt-omni repository doesn't share training code and its vocoder isn't good. I want to use Ultravox. However, the model must have image support. Is this possible?

https://github.com/gpt-omni/mini-omni2

farzadab · 2024-12-04T22:22:48Z

The model in this PR will only be able to output text (not speech), but on the input side, yes it does allow for a combination of all three modalities: image, audio, and, text without further training.

Ultravox itself combines audio and text and Llama 3.2 combines text and image, and since they both do this without touching the base LLM, their changes seem to be compatible (as this PR shows).
Keep in mind that I've only tested the combined model on a handful of samples, so it's not very rigorous and it might come short in some situations.

farzadab added 3 commits September 30, 2024 14:58

partial 3.2 support

f5b7715

fix

8957b85

working script of combining audio + image + text

50094bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

farzadab commented Oct 1, 2024

kadirnar commented Dec 4, 2024

farzadab commented Dec 4, 2024

kadirnar commented Dec 4, 2024

farzadab commented Dec 4, 2024 •

edited

Loading

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Are you sure you want to change the base?

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Conversation

farzadab commented Oct 1, 2024

kadirnar commented Dec 4, 2024

farzadab commented Dec 4, 2024

kadirnar commented Dec 4, 2024

farzadab commented Dec 4, 2024 • edited Loading

farzadab commented Dec 4, 2024 •

edited

Loading