To install the dependencies, run
pip install -r requirements.txt
- Download Flickr30k dataset from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
- To evaluate the finetuned BLIP model on Flickr30k, run:
python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr \ --evaluate
- To finetune the pre-trained checkpoint, first set 'pretrained' in configs/retrieval_flickr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr
- Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
- In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
python pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain
The implementation used here solely relies on BLIP code from Salesforce , ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.