- [12/06] Make the models and evaluation code available; the manuscript v2 will be posted on ArXiv in two days.
- [11/06] Upload the initial version of the manuscript to arXiv.
conda create -n infmllm python=3.9
conda activate infmllm
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Both the multitask and instruction tuning models are now available on Hugging Face!
We conducted evaluations of the InfMLLM-7B multitask model across five VQA (Visual Question Answering) datasets and three visual grounding datasets. Meanwhile, the InfMLLM-7B-Chat model, tuned for instruction-following, was assessed on four VQA datasets and six multi-modal benchmarks. For detailed evaluation procedures, please refer to Evaluation.
Trying InfMLLM-7B-Chat is straightforward. We've provided a demo script to run on the following example image.
CUDA_VISIBLE_DEVICES=0 python demo.py
The conversation generated is shown below.
@misc{zhou2023infmllm,
title={InfMLLM: A Unified Framework for Visual-Language Tasks},
author={Qiang Zhou and Zhibin Wang and Wei Chu and Yinghui Xu and Hao Li and Yuan Qi},
year={2023},
eprint={2311.06791},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!