English | 简体中文
./${LLM}_${ViT}/
contains configs that align with LLaVA-InternLM settings (i.e., using LoRA / QLoRA)../official/
contains configs that align with LLaVA official settings.
XTuner primarily promotes the LLM-QLoRA / ViT-LoRA LLaVA architecture, and the evaluation results on various datasets are as follows:
Model | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | MMMU Dev | MathVista MiniTest | HallusionBench aAcc | Configs | Pretrained Projector Checkpoints | Fine-tuned LLaVA Checkpoints |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B (XTuner) | 67.7 | 69.2 | 61.0 | 59.7 | 28.4 | 1716 | 66.4 | 32.2 | 33.7 | 24.2 | 46.2 | Pretrain / Fine-tune | 🤗 HuggingFace / 🤖 ModelScope | 🤗 HuggingFace / 🤖 ModelScope |
LLaVA-v1.5-13B (XTuner) | 68.8 | 69.5 | 64.7 | 63.1 | 32.9 | 1766 | 67.9 | 35.9 | 35.2 | 26.2 | 46.9 | Pretrain / Fine-tune | 🤗 HuggingFace / 🤖 ModelScope | 🤗 HuggingFace / 🤖 ModelScope |
LLaVA-InternLM-7B (XTuner) | 69.0 | 68.5 | 66.7 | 63.8 | 37.3 | 1637 | 65.7 | 32.4 | 36.9 | 26.3 | 49.1 | Pretrain / Fine-tune | 🤗 HuggingFace / 🤖 ModelScope | 🤗 HuggingFace / 🤖 ModelScope |
LLaVA-InternLM2-7B (XTuner) | 73.3 | 74.6 | 71.7 | 72.0 | 42.5 | 1700 | 71.2 | 35.9 | 40.1 | 25.5 | 46.8 | Pretrain / Fine-tune | 🤗 HuggingFace / 🤖 ModelScope | 🤗 HuggingFace / 🤖 ModelScope |
LLaVA-InternLM2-20B (XTuner) | 75.1 | 73.5 | 73.7 | 72.8 | 46.3 | 1868 | 70.2 | 37.2 | 39.4 | 24.6 | 47.7 | Pretrain / Fine-tune | 🤗 HuggingFace / 🤖 ModelScope | 🤗 HuggingFace / 🤖 ModelScope |
When aligned completely with the official training settings, the results are as follows:
Model | Framework | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | Configs |
---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | Official | 65.2 | 63.0 | 57.3 | 57.4 | 25.2 | 1775 | 65.6 | 32.7 | - |
LLaVA-v1.5-7B | XTuner | 68.6 | 68.0 | 61.5 | 61.4 | 26.5 | 1786 | 65.8 | 31.4 | Pretrain / Fine-tune |
Please refer to the docs.
The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning
Note: this guide takes 8-card training LLaVA-InternLM2-7B as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in ./work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth
.
- Alignment module pretraining (saved by default in
./work_dirs/
)
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
- Instruction following fine-tuning (saved by default in
./work_dirs/
)
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
After training, we will obtain a set of weights (i.e., iter_xxx.pth
), which are not in the universal HuggingFace format. We first need to convert them.
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./iter_5198.pth ./iter_5198_hf
At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:
(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip
You can download the released LLaVA-InternLM2-7B model from 🤗 HuggingFace or 🤖 ModelScope, and achieve image-text question answering with the following command!
xtuner chat internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--image $IMAGE_PATH
Here, --llava
is the converted weight from the above step (in our example, it is ./iter_5198_hf
).
XTuner's LLaVA models can be evaluated using VLMEvalKit.
For convenience, XTuner also integrates the MMBench evaluation.
User can download the MMBench dataset with
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
After that, the evaluations can be run with
xtuner mmbench internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
Here, $DATA_PATH
refers to one of the datasets downloaded as mentioned above, such as MMBench_DEV_EN.tsv
.
After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit mmbench_result.xlsx
to the official MMBench for final evaluation to obtain precision results!
To evaluate your model with refcoco, you need download the evaluation data files in link. Second, you can use following command to evaluate your model.
xtuner eval_refcoco $LLM \
--visual-encoder $VISUAL_ENCODER \
--llava $LLAVA_PATH \
--prompt-template $PROMPT_TEMPLATE \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH