FG-CLIP

Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".

This repository contains the code to perform training and evaluation of CLIP on the object crop of the FG-OVD training sets and benchmarks.

The checkpoints directory stores the parameters obtained from these trainings. To utilize these pre-trained CLIP projections which enhance CLIP's fine-grained understanding without repeating the training process, please refer to Load weights.

Updates

🔥 09/2024: "Is CLIP the main roadblock for fine-grained open-world perception?" won the Best Paper Award at CBMI 2024!

Installation

conda create --name clip python=3.9 -y
conda activate clip
git clone --recursive https://github.com/lorebianchi98/FG-CLIP.git
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
cd CLIP
python setup.py install
cd ..
pip install -r requirements.txt

NOTE: This project uses a custom version of CLIP because it allows us to extract all tokens from the visual and textual encoders, not just the CLS token. If your goal is to extract only the CLS token (as done in the standard usage of this repo), you can install the official version of CLIP from the official CLIP repository.

Feature Extraction

To accelerate the training process, we utilize pre-extracted CLIP features within this repository. Please adhere to the following guidelines for feature extraction.

COCO

To set up the required data for feature extraction, follow these steps:

Create a folder named "coco" using the following command:
```
mkdir coco
```
Download the 2014 images and annotations from the official COCO website. Ensure that you place the downloaded files inside the newly created "coco" folder.

Run the following commands to start coco feature extraction:

cd features
python extract.py --gpu GPU_NUMBER --batch_size BATCH_SIZE --model MODEL
# Example extraction command:
# python extract.py --gpu 0 --batch_size 16 --model ViT-B/16

Download the COCO Karpathy splits. Run the following commands to move the pre-extracted feature in the Karpathy splits:

python create_karpathy_splits.py --target_splits KARPATHY_SPLITS_DIR --src_features_dir COCO_FEATURES_DIR --out_dir OUT_DIR

FG-OVD Benchmarks

To pre-extract features from the FG-OVD Benchmarks, download it from the official FG-OVD repository.

Run the following commands:

cd fg-ovd_feature_extraction
# scale_factor: multiplier to the coordinates of the bounding boxes, the higher the value, the higher the context of the crop. Default = 1.0
# model: CLIP configuration to use. Default = ViT-B/16
python extraction --dataset_dir FG-OVD_DIR -coco_path COCO_PATH --out_dir OUT_DIR --batch_size BATCH_SIZE --scale_factor SCALE_FACTOR --model MODEL

Vanilla CLIP vs. FG-OVD Evaluation

Run the following commands:

cd fgovd_evaluation/
python main.py

This will create outputs in the format of FG-OVD. Use the script in the original repo or refer to get_ranks inside this script.

CLIP Fine-grained repurpose

Training

To perform a training, run the following command:

CUDA_VISIBLE_DEVICES=GPU python train.py --train_config configs/train/TRAINING_CONFIG --model_config configs/model/MODEL_CONFIG

Evaluation

This command will create a JSON with the results on both COCO and the FG-OVD dataset for each possible configuration:

CUDA_VISIBLE_DEVICES=GPU PYTHONPATH=. python plots/test.py --results_file OUT

Load weights

To load a configuration with a set of saved weights, edit the corresponding yaml file in configs/model and set the field initial_weights to the path of the your checkpoints. Then, in your python script, run the following commands:

from src.model import CrossAttentionModule, MLPs
model = MLPs.from_config(MODEL_PATH) # CrossAttentionModule.from_config(MODEL_PATH)

# usage example with 2 as batch size
image_embeddings = torch.rand(2, 512) 
text_embeddings = torch.rand(2, 512)
similarities = model(image_embeddings, text_embeddings)
# in case you are using MLPs, you can also extract image and text embeddings repurposed
repurposed_image_embeddings, repurposed_text_embeddings = model(image_embeddings, text_embeddings, ret_embeds=True)

Reference

If you found this code useful, please cite the following paper:

    @misc{bianchi2024clip,
        title={Is CLIP the main roadblock for fine-grained open-world perception?}, 
        author={Lorenzo Bianchi and Fabio Carrara and Nicola Messina and Fabrizio Falchi},
        year={2024},
        eprint={2404.03539},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FG-CLIP

Updates

Installation

Feature Extraction

COCO

FG-OVD Benchmarks

Vanilla CLIP vs. FG-OVD Evaluation

CLIP Fine-grained repurpose

Training

Evaluation

Load weights

Reference

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
checkpoints		checkpoints
configs		configs
eval		eval
features		features
fg-ovd_feature_extraction		fg-ovd_feature_extraction
fgovd_evaluation		fgovd_evaluation
src		src
.gitmodules		.gitmodules
README.md		README.md
plots.ipynb		plots.ipynb
requirements.txt		requirements.txt
train.py		train.py

lorebianchi98/FG-CLIP

Folders and files

Latest commit

History

Repository files navigation

FG-CLIP

Updates

Installation

Feature Extraction

COCO

FG-OVD Benchmarks

Vanilla CLIP vs. FG-OVD Evaluation

CLIP Fine-grained repurpose

Training

Evaluation

Load weights

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages