Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".
This repository contains the code to perform training and evaluation of CLIP on the object crop of the FG-OVD training sets and benchmarks.
The checkpoints
directory stores the parameters obtained from these trainings. To utilize these pre-trained CLIP projections which enhance CLIP's fine-grained understanding without repeating the training process, please refer to Load weights.
- 🔥 09/2024: "Is CLIP the main roadblock for fine-grained open-world perception?" won the Best Paper Award at CBMI 2024!
conda create --name clip python=3.9 -y
conda activate clip
git clone --recursive https://github.com/lorebianchi98/FG-CLIP.git
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
cd CLIP
python setup.py install
cd ..
pip install -r requirements.txt
NOTE: This project uses a custom version of CLIP because it allows us to extract all tokens from the visual and textual encoders, not just the CLS token. If your goal is to extract only the CLS token (as done in the standard usage of this repo), you can install the official version of CLIP from the official CLIP repository.
To accelerate the training process, we utilize pre-extracted CLIP features within this repository. Please adhere to the following guidelines for feature extraction.
To set up the required data for feature extraction, follow these steps:
-
Create a folder named "coco" using the following command:
mkdir coco
-
Download the 2014 images and annotations from the official COCO website. Ensure that you place the downloaded files inside the newly created "coco" folder.
-
Run the following commands to start coco feature extraction:
cd features python extract.py --gpu GPU_NUMBER --batch_size BATCH_SIZE --model MODEL # Example extraction command: # python extract.py --gpu 0 --batch_size 16 --model ViT-B/16
-
Download the COCO Karpathy splits. Run the following commands to move the pre-extracted feature in the Karpathy splits:
python create_karpathy_splits.py --target_splits KARPATHY_SPLITS_DIR --src_features_dir COCO_FEATURES_DIR --out_dir OUT_DIR
- To pre-extract features from the FG-OVD Benchmarks, download it from the official FG-OVD repository.
- Run the following commands:
cd fg-ovd_feature_extraction # scale_factor: multiplier to the coordinates of the bounding boxes, the higher the value, the higher the context of the crop. Default = 1.0 # model: CLIP configuration to use. Default = ViT-B/16 python extraction --dataset_dir FG-OVD_DIR -coco_path COCO_PATH --out_dir OUT_DIR --batch_size BATCH_SIZE --scale_factor SCALE_FACTOR --model MODEL
Run the following commands:
cd fgovd_evaluation/
python main.py
This will create outputs in the format of FG-OVD. Use the script in the original repo or refer to get_ranks inside this script.
To perform a training, run the following command:
CUDA_VISIBLE_DEVICES=GPU python train.py --train_config configs/train/TRAINING_CONFIG --model_config configs/model/MODEL_CONFIG
This command will create a JSON with the results on both COCO and the FG-OVD dataset for each possible configuration:
CUDA_VISIBLE_DEVICES=GPU PYTHONPATH=. python plots/test.py --results_file OUT
To load a configuration with a set of saved weights, edit the corresponding yaml file in configs/model
and set the field initial_weights
to the path of the your checkpoints.
Then, in your python script, run the following commands:
from src.model import CrossAttentionModule, MLPs
model = MLPs.from_config(MODEL_PATH) # CrossAttentionModule.from_config(MODEL_PATH)
# usage example with 2 as batch size
image_embeddings = torch.rand(2, 512)
text_embeddings = torch.rand(2, 512)
similarities = model(image_embeddings, text_embeddings)
# in case you are using MLPs, you can also extract image and text embeddings repurposed
repurposed_image_embeddings, repurposed_text_embeddings = model(image_embeddings, text_embeddings, ret_embeds=True)
If you found this code useful, please cite the following paper:
@misc{bianchi2024clip,
title={Is CLIP the main roadblock for fine-grained open-world perception?},
author={Lorenzo Bianchi and Fabio Carrara and Nicola Messina and Fabrizio Falchi},
year={2024},
eprint={2404.03539},
archivePrefix={arXiv},
primaryClass={cs.CV}
}