By default, we train our OLA-VLM models using two stages:
- Pre-Training (PT) with the next-token prediction objective and embedding losses. We set the parameters belonging to the MLP Projector, embedding predictors, and the special task tokens as learnable during the PT stage.
- Instruction Fine-tuning (IFT) with only the next-token prediction objective with the MLP Projector and the LLM parameters set as learnable.
We use the LLaVA-558K dataset during the PT stage.
cd datasets
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
cd LLaVA-Pretrain && unzip images.zip & cd ../..
- We train for one epoch with a total batch size of 256 per iteration.
- You can change the base LLM/encoder in scripts/train/pretrain.sh
We use the LLaVA-665K dataset during the IFT stage.
cd datasets
wget https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/resolve/main/llava_v1_5_mix665k.json
# COCO
mkdir coco && cd coco && wget http://images.cocodataset.org/zips/train2017.zip && unzip train.zip && cd ..
# GQA
mkdir gqa && cd gqa && wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip && unzip images.zip && cd ..
# OCR-VQA
cd ocr_vqa && python loadDataset.py && cd ..
# TextVQA
mkdir text_vqa && cd text_vqa && wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip && cd ..
# VG
mkdir vg && cd vg && wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip && wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip && unzip images.zip && unzip images2.zip && cd ..
- We train for one epoch with a total batch size of 128 per iteration.
- You can change the base LLM/encoder in scripts/train/finetune.sh
bash scripts/train/finetune.sh
We train the whole model with the next-token prediction objective on the ALLaVA-Caption data after the PT stage and before the IFT stage, and we term this stage Visual Pre-Training.
Follow the instructions given here and put the ALLaVA-Caption data under the datasets
directory.
- We train for one epoch with a total batch size of 128 per iteration.
- You can change the base LLM/encoder in scripts/train/vpt.sh and scripts/train/vpt_ift.sh.
# VPT Stage
bash scripts/train/vpt.sh
# IFT Stage
bash scripts/train/vpt_ift.sh