This repository provides the official implementation of ShareLock, an ultra-lightweight CLIP-like vision-language model, introduced in the paper:
"Do Better Language Models Have Crisper Vision?"
Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano
📄 Read the Paper on arXiv
🤗 Model Checkpoints on Hugging Face
🌐 More Information on our Project Page
ShareLock is a straightforward and efficient approach to building vision-language models. By leveraging frozen features from strong unimodal vision and language models, it achieves competitive multimodal performance with minimal computational resources. Key highlights include:
- Data Efficiency: Trained on just 563k image-caption pairs, ShareLock achieves 51% zero-shot accuracy on ImageNet.
- Cost Efficiency: Training requires only 1 GPU hour (10 hours including feature precomputation).
- Competitive Results: Outperforms existing models in low-data regimes while maintaining scalability.
- Ultra-Lightweight: Minimal training time with competitive results.
- Pretrained Backbone: Leverages strong, frozen unimodal features.
- Low Resource Requirement: Trainable with only one GPU in hours.
- Zero-Shot Capabilities: Effective on ImageNet and beyond.
- CLIP-like VLM: apply common refinement techniques (e.g., prompt tuning, LLM-based descriptions, etc.)
-
Clone the Repository:
git clone https://github.com/JonaRuthardt/ShareLock.git cd ShareLock
-
Set up a Python Environment:
python -m venv env source env/bin/activate # On Windows: env\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Download Datasets: Training and validation of the model requires the presence of paired image-caption data. Popular small-scale datasets include CC3M, CC12M and YFCC15M and can be downlaoded in webdataset format using the img2dataset library.
-
Precompute Features: Use pretrained models to extract vision and text embeddings:
python precompute_image_features_hf.py # Hugginface datasets for classification tasks (test of final model) python precompute_image_features_hf.py # Image-caption datasets in webdataset format (training and validation) python precompute_language_features.py # JSON file containing caption for each uid in image dataset
The dataset and backbone model to be used can be configured in the respective files via command line arguments. The precomputed features will be stored via the FeatureUtils library. The presence of a json file with iamge uids as keys and the corresponding captions as values is assumed. This information is consequently read and processed by the
precompute_language_features.py
script. -
Train the Projection Network: The image and text features are aligned by running:
python train.py
Settings and hyperparameters can be specified or changed in
configs/default_config.yaml
. -
Evaluate Model on VLM Taks: The ShareLock class implements the
encode_text
andencode_image
functions which can be used for inference on downstream vision-language modeling tasks.
We provide pretrained checkpoints for ShareLock on Hugging Face for easy integration and experimentation:
You can load these models directly using the ShareLock
class:
from sharelock.models.model import ShareLock
model = ShareLock.load_from_checkpoint("path/to/checkpoint.ckpt", config=config)
Alternatively, the --checkpoint
flag can be passed to the train.py
file.
Our reported results were obtained via the CLIP-Benchmark codebase. A subset of classification results in presented in the following table:
Zero-shot classification on ImageNet variants:
Model | Dataset | IN-1k | IN-R | IN-A |
---|---|---|---|---|
CLIP | CC3M | 16.0% | 17.6% | 3.6% |
LiT | CC3M | 44.1% | 62.7% | 45.6% |
ShareLock | CC3M | 52.1% | 64.1% | 50.9% |
CLIP | CC12M | 41.6% | 52.6% | 3.6% |
LiT | CC12M | 56.2% | 70.3% | 52.8% |
ShareLock | CC12M | 59.1% | 68.8% | 53.4% |
For a comprehensive and detailed evaluation of ShareLock across various vision-language-modelling tasks, see our paper.
If you use this work, please cite:
@article{ruthardt2024sharelock,
title={Do Better Language Models Have Crisper Vision?},
author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
journal={arXiv preprint arXiv:2410.07173},
year={2024}
}
For any questions or collaborations, contact Jona Ruthardt.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.