LLaVA Full Pipeline

Configs

./${LLM}_${ViT}/ contains configs that align with LLaVA-InternLM settings (i.e., using LoRA / QLoRA).
./official/ contains configs that align with LLaVA official settings.

Results

XTuner primarily promotes the LLM-QLoRA / ViT-LoRA LLaVA architecture, and the evaluation results on various datasets are as follows:

Model	MMBench Test (EN)	MMBench Dev (EN)	MMBench Test (CN)	MMBench Dev (CN)	CCBench Dev	MME	SEEDBench_IMG	MMVet	MMMU Dev	MathVista MiniTest	HallusionBench aAcc	Configs	Pretrained Projector Checkpoints	Fine-tuned LLaVA Checkpoints
LLaVA-v1.5-7B (XTuner)	67.7	69.2	61.0	59.7	28.4	1716	66.4	32.2	33.7	24.2	46.2	Pretrain / Fine-tune	🤗 HuggingFace / 🤖 ModelScope	🤗 HuggingFace / 🤖 ModelScope
LLaVA-v1.5-13B (XTuner)	68.8	69.5	64.7	63.1	32.9	1766	67.9	35.9	35.2	26.2	46.9	Pretrain / Fine-tune	🤗 HuggingFace / 🤖 ModelScope	🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM-7B (XTuner)	69.0	68.5	66.7	63.8	37.3	1637	65.7	32.4	36.9	26.3	49.1	Pretrain / Fine-tune	🤗 HuggingFace / 🤖 ModelScope	🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM2-7B (XTuner)	73.3	74.6	71.7	72.0	42.5	1700	71.2	35.9	40.1	25.5	46.8	Pretrain / Fine-tune	🤗 HuggingFace / 🤖 ModelScope	🤗 HuggingFace / 🤖 ModelScope
LLaVA-InternLM2-20B (XTuner)	75.1	73.5	73.7	72.8	46.3	1868	70.2	37.2	39.4	24.6	47.7	Pretrain / Fine-tune	🤗 HuggingFace / 🤖 ModelScope	🤗 HuggingFace / 🤖 ModelScope

When aligned completely with the official training settings, the results are as follows:

Model	Framework	MMBench Test (EN)	MMBench Dev (EN)	MMBench Test (CN)	MMBench Dev (CN)	CCBench Dev	MME	SEEDBench_IMG	MMVet	Configs
LLaVA-v1.5-7B	Official	65.2	63.0	57.3	57.4	25.2	1775	65.6	32.7	-
LLaVA-v1.5-7B	XTuner	68.6	68.0	61.5	61.4	26.5	1786	65.8	31.4	Pretrain / Fine-tune

Data Preparation

Please refer to the docs.

Training

The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning

Note: this guide takes 8-card training LLaVA-InternLM2-7B as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in ./work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth.

Alignment module pretraining (saved by default in ./work_dirs/)

NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2

Instruction following fine-tuning (saved by default in ./work_dirs/)

NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2

Model Convert (and Merge)

After training, we will obtain a set of weights (i.e., iter_xxx.pth), which are not in the universal HuggingFace format. We first need to convert them.

xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./iter_5198.pth ./iter_5198_hf

At this point, we have obtained the relevant model (LLM or the corresponding LoRA).

Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:

(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip

Chat

You can download the released LLaVA-InternLM2-7B model from 🤗 HuggingFace or 🤖 ModelScope, and achieve image-text question answering with the following command!

xtuner chat internlm/internlm2-chat-7b \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava xtuner/llava-internlm2-7b \
  --prompt-template internlm2_chat \
  --image $IMAGE_PATH

Here, --llava is the converted weight from the above step (in our example, it is ./iter_5198_hf ).

Evaluation

XTuner's LLaVA models can be evaluated using VLMEvalKit.

For convenience, XTuner also integrates the MMBench evaluation.

User can download the MMBench dataset with

wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv

After that, the evaluations can be run with

xtuner mmbench internlm/internlm2-chat-7b \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava xtuner/llava-internlm2-7b \
  --prompt-template internlm2_chat \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH

Here, $DATA_PATH refers to one of the datasets downloaded as mentioned above, such as MMBench_DEV_EN.tsv.

After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit mmbench_result.xlsx to the official MMBench for final evaluation to obtain precision results!

Refcoco

To evaluate your model with refcoco, you need download the evaluation data files in link. Second, you can use following command to evaluate your model.

xtuner eval_refcoco $LLM \
  --visual-encoder $VISUAL_ENCODER \
  --llava $LLAVA_PATH \
  --prompt-template $PROMPT_TEMPLATE \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLaVA Full Pipeline

Configs

Results

Data Preparation

Training

Model Convert (and Merge)

Chat

Evaluation

Refcoco

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLaVA Full Pipeline

Configs

Results

Data Preparation

Training

Model Convert (and Merge)

Chat

Evaluation

Refcoco