Jitesh Jain*, Zhengyuan Yang, Humphrey Shi†, Jianfeng Gao†, Jianwei Yang†
*Work done during an internship at Microsoft Research, Redmond †Equal Advising
[Project Page
] | [arXiv
] [Model Checkpoints
] [Video
] [BibTeX
]
This repo contains the code for our paper OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.
We propose distilling target visual information into the intermediate representations of the LLM from a set of target encoders. We adopt a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.
- [December 14, 2024]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support with the ZeroGPU grant! 🤗
- [December 12, 2024]: 🚀 Project Page, ArXiv Preprint and GitHub Repo are public! We also open-source the model checkpoints and probes on huggingface hub! 🎁
Note: We trained all our models on AMD MI300x GPUs. However, in this repo, we provide instructions for Nvidia GPUs considering their wider usage.
-
Clone this repository.
git lfs install git clone https://github.com/SHI-Labs/OLA-VLM cd OLA-VLM
-
Setup conda environment with the base dependencies.
conda create -n ola_vlm -y conda activate ola_vlm pip install -e . pip install flash-attn --no-build-isolation pip install scikit-learn icecream datasets pytorch-fid lpips opencv-python-headless pip install setuptools==61.0.0 pip install -e lmms-eval/ pip install huggingface_hub==0.24.7 pip install transformers==4.41.1
You can use the Gradio interface to interact with OLA-VLM locally. The demo also supports visualizing the respresentations from the slected intermediate LLM layers (embedding loss positions).
# install demo-specific libraries
pip install -e .["demo"]
# start the demo
CUDA_VISIBLE_DEVICES=0 python demo.py --model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b --PT-model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b
Note: We provide the guide to integrating the embeddding losses from OLA-VLM into any custom MLLM in Custom_MLLM.md
- Please see Training.md for training commands and dataset preparation.
- We train all our models using 16 192G MI300X AMD GPUs.
Please see Evaluation.md for evaluation commands
Please see Probing.md for probing commands.
Method | Training Stages | LLM | Base Encoder | CV-Bench | MMStar | RWQA | OK-VQA | Checkpoint |
---|---|---|---|---|---|---|---|---|
OLA-VLM | PT + IFT | Phi3-4k-mini | CLIP-ViT-L | 62.5 | 36.0 | 58.0 | 56.4 | ckpt |
OLA-VLM | PT + IFT | Phi3-4k-mini | CLIP-ConvNeXT-XXL | 63.9 | 38.4 | 58.4 | 56.5 | ckpt |
OLA-VLM | PT + IFT | Llama3-8b | CLIP-ViT-L | 61.4 | 39.5 | 57.9 | 56.6 | ckpt |
OLA-VLM | PT + IFT | Llama3-8b | CLIP-ConvNeXT-XXL | 61.5 | 38.5 | 55.0 | 59.0 | ckpt |
OLA-VLM | PT + VPT + IFT | Llama3-8b | CLIP-ConvNeXT-XXL | 64.6 | 40.6 | 62.9 | 61.1 | ckpt |
If you found OLA-VLM useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@article{jain2024ola_vlm,
title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
journal={arXiv},
year={2024}
}
We thank the authors of LLaVA-1.5, OneFormer, Depth-Anything v2, and unCLIP-SD for open-sourcing their codebase and checkpoints. We are grateful to the authors of cambrian and MMStar for releasing their code for CV-Bench and MMStar evaluation, respectively.