GroundVLP: A simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
AAAI 2024 (arXiv 2312.15043)
- First, you should install PyTorch ≥ 1.8. Please install them together at pytorch.org, please check PyTorch version matches that is required by Detectron2.
- For using Detic, you should install Detectron2. You could follow Detectron2 installation instructions to install this.
- Install requirements:
pip install -r requirements.txt
An example code for setting up the environment:
# create a new environment
conda create --name groundvlp python=3.8
conda activate groundvlp
git clone https://github.com/om-ai-lab/GroundVLP.git
cd GroundVLP
# install pytorch
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
# install detectron2
python -m pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
# install requirements
pip install -r requirements.txt
Download the following checkpoints and place them at the path checkpoints/
:
Download the following link and unzip them at the path data/
:
Download the COCO images and unzip them at the path images/train2014
:
Finally, the folder tree is that:
GroundVLP
├── checkpoints
│ └── ALBEF.pth
│ └── Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
├── data
│ └── refcoco_val_info.json
│ └── ...
│ └── refcocog_val_info.json
│── images
│ └── train2014
│ └── COCO_train2014_xxx.jpg
...
Run this command to evaluate GroundVLP on REC datasets using the ground-truth category:
python eval_rec.py \
--image_folder="./images/train2014" \
--eval_data="refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test" \
--model_id="ALBEF" \
--use_gt_category
Currently, the code we release only supports the ALBEF and TCL models. We will continue to update the code to support more models.
If you want to get the results using the predicted category, you should get the agent of each query first and map it into the coco label:
python utils/map_to_coco_label.py
Then run thie command:
python eval_rec.py \
--image_folder="./images/train2014" \
--eval_data="refcoco_val,refcoco_testA,refcoco_testB,refcoco+_val,refcoco+_testA,refcoco+_testB,refcocog_val,refcocog_test" \
--model_id="ALBEF" \
Run this command to evaluate GroundVLP on a single image-query pair:
python demo.py \
--image_path="./docs/demo.jpg" \
--query="boy with white hair" \
If setup correctly, the output image in the path output/demo.jpg
should look like:
If you find this project useful for your research, please use the following BibTeX entry.
@article{shen2023groundvlp,
title={GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection},
author={Shen, Haozhan and Zhao, Tiancheng and Zhu, Mingwei and Yin, Jianwei},
journal={arXiv preprint arXiv:2312.15043},
year={2023}
}