1 KAUST, 2 University of Oxford,
🔥 Stay tuned for updates, and don't forget to star this repo for the latest on SynthCLIP! 🔥
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images.
First, let's set up the Conda environment to get you up and running:
conda create -n synthclip python=3.10 -y
conda activate synthclip
pip install https://github.com/vllm-project/vllm/releases/download/v0.3.0/vllm-0.3.0+cu118-cp310-cp310-manylinux1_x86_64.whl
pip uninstall torch -y
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip uninstall xformers -y
pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
To add a new section to your README that explains the process and structure of your project, including the specific order of operations and the README files in different directories, you might format it like this:
Our project is organized into three main folders, each dedicated to a specific stage in the SynthCLIP pipeline. Inside each folder, you'll find a detailed README.md
file that provides instructions on how to run the code for that stage.
-
TextGen
: This folder contains all the necessary code to generate synthetic text data. Begin here to start the pipeline process. -
ImageGen
: After generating the text, move on to this folder. It uses the synthetic text data to generate corresponding synthetic images. -
Training
: The final stage of the pipeline. Once you have your synthetic text-image pairs, this folder contains the code to train the SynthCLIP model.
To successfully use SynthCLIP, follow the pipeline in the order mentioned:
- Generate Text ➡️ Start with the
TextGen
folder. - Generate Images ➡️ Proceed to
ImageGen
with your synthetic text. - Train the Model ➡️ Finally, use the
Training
folder to train SynthCLIP with your synthetic text-image pairs.
Our dataset, SynthCI 30M, containing 30M image-caption pairs is hosted on HuggingFace. To download the dataset using HuggingFace Client please ensure that you have the huggingface-cli module installed by running:
pip install -U "huggingface_hub[cli]"
The dataset could then be installed using huggingface-cli download hammh0a/SynthCLIP --repo-type dataset
.
Alternatively, the dataset could be loaded using HuggingFace datasets
library in Python as follows:
from datasets import load_dataset
dataset = load_dataset('hammh0a/SynthCLIP')
Jumpstart your experiments with our pre-trained models:
- ViT-B/16 Trained on SynthCI-10M ➡️ Download
- ViT-B/16 Trained on SynthCI-20M ➡️ Download
- ViT-B/16 Trained on SynthCI-30M ➡️ Download
- ViT-B/16 Trained on CC12M ➡️ Download
You can load and use the pretrained model using the code below:
from models import CLIP_VITB16
import torch
# load model instance
model = torch.nn.DataParallel(CLIP_VITB16())
# load checkpoint
checkpoint_path = "./checkpoint_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")
load_status = model.load_state_dict(checkpoint["state_dict"])
print(load_status)
If you find SynthCLIP useful in your research, please consider citing:
@misc{hammoud2024synthclip,
title={SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?},
author={Hasan Abed Al Kader Hammoud and Hani Itani and Fabio Pizzati and Philip Torr and Adel Bibi and Bernard Ghanem},
year={2024},
eprint={2402.01832},
archivePrefix={arXiv},
primaryClass={cs.CV}
}