ShareLock: Ultra-Lightweight CLIP-like Vision-Language Model

This repository provides the official implementation of ShareLock, an ultra-lightweight CLIP-like vision-language model, introduced in the paper:
"Do Better Language Models Have Crisper Vision?"
Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

📄 Read the Paper on arXiv
🤗 Model Checkpoints on Hugging Face
🌐 More Information on our Project Page

🧪 Overview

ShareLock is a straightforward and efficient approach to building vision-language models. By leveraging frozen features from strong unimodal vision and language models, it achieves competitive multimodal performance with minimal computational resources. Key highlights include:

Data Efficiency: Trained on just 563k image-caption pairs, ShareLock achieves 51% zero-shot accuracy on ImageNet.
Cost Efficiency: Training requires only 1 GPU hour (10 hours including feature precomputation).
Competitive Results: Outperforms existing models in low-data regimes while maintaining scalability.

🚀 Features

Ultra-Lightweight: Minimal training time with competitive results.
Pretrained Backbone: Leverages strong, frozen unimodal features.
Low Resource Requirement: Trainable with only one GPU in hours.
Zero-Shot Capabilities: Effective on ImageNet and beyond.
CLIP-like VLM: apply common refinement techniques (e.g., prompt tuning, LLM-based descriptions, etc.)

🛠️ Installation

Clone the Repository:

git clone https://github.com/JonaRuthardt/ShareLock.git
cd ShareLock

Set up a Python Environment:

python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```

📦 Usage

Download Datasets: Training and validation of the model requires the presence of paired image-caption data. Popular small-scale datasets include CC3M, CC12M and YFCC15M and can be downlaoded in webdataset format using the img2dataset library.
Precompute Features: Use pretrained models to extract vision and text embeddings:
```
python precompute_image_features_hf.py # Hugginface datasets for classification tasks (test of final model)
python precompute_image_features_hf.py # Image-caption datasets in webdataset format (training and validation)
python precompute_language_features.py # JSON file containing caption for each uid in image dataset
```
The dataset and backbone model to be used can be configured in the respective files via command line arguments. The precomputed features will be stored via the FeatureUtils library. The presence of a json file with iamge uids as keys and the corresponding captions as values is assumed. This information is consequently read and processed by the precompute_language_features.py script.
Train the Projection Network: The image and text features are aligned by running:
```
python train.py
```
Settings and hyperparameters can be specified or changed in configs/default_config.yaml.
Evaluate Model on VLM Taks: The ShareLock class implements the encode_text and encode_image functions which can be used for inference on downstream vision-language modeling tasks.

📂 Pretrained Model Checkpoints

We provide pretrained checkpoints for ShareLock on Hugging Face for easy integration and experimentation:

ShareLock (CC3M-trained):
ShareLock (CC12M-trained):

You can load these models directly using the ShareLock class:

from sharelock.models.model import ShareLock

model = ShareLock.load_from_checkpoint("path/to/checkpoint.ckpt", config=config)

Alternatively, the --checkpoint flag can be passed to the train.py file.

📊 Results

Our reported results were obtained via the CLIP-Benchmark codebase. A subset of classification results in presented in the following table:

Zero-shot classification on ImageNet variants:

Model	Dataset	IN-1k	IN-R	IN-A
CLIP	CC3M	16.0%	17.6%	3.6%
LiT	CC3M	44.1%	62.7%	45.6%
ShareLock	CC3M	52.1%	64.1%	50.9%
CLIP	CC12M	41.6%	52.6%	3.6%
LiT	CC12M	56.2%	70.3%	52.8%
ShareLock	CC12M	59.1%	68.8%	53.4%

For a comprehensive and detailed evaluation of ShareLock across various vision-language-modelling tasks, see our paper.

📜 Citation

If you use this work, please cite:

@article{ruthardt2024sharelock,
  title={Do Better Language Models Have Crisper Vision?},
  author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
  journal={arXiv preprint arXiv:2410.07173},
  year={2024}
}

📧 Contact

For any questions or collaborations, contact Jona Ruthardt.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
sharelock		sharelock
.gitignore		.gitignore
DiagramShareLock.png		DiagramShareLock.png
LICENSE		LICENSE
README.md		README.md
precompute_image_features_hf.py		precompute_image_features_hf.py
precompute_image_features_wds.py		precompute_image_features_wds.py
precompute_language_features.py		precompute_language_features.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShareLock: Ultra-Lightweight CLIP-like Vision-Language Model

🧪 Overview

🚀 Features

🛠️ Installation

📦 Usage

📂 Pretrained Model Checkpoints

📊 Results

📜 Citation

📧 Contact

About

Releases

Packages

Languages

License

JonaRuthardt/ShareLock

Folders and files

Latest commit

History

Repository files navigation

ShareLock: Ultra-Lightweight CLIP-like Vision-Language Model

🧪 Overview

🚀 Features

🛠️ Installation

📦 Usage

📂 Pretrained Model Checkpoints

📊 Results

📜 Citation

📧 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages