Skip to content

Official implementation of ShareLock, an ultra-lightweight CLIP-like vision-language model.

License

Notifications You must be signed in to change notification settings

JonaRuthardt/ShareLock

Repository files navigation

ShareLock: Ultra-Lightweight CLIP-like Vision-Language Model

arXiv Project Page License Python

This repository provides the official implementation of ShareLock, an ultra-lightweight CLIP-like vision-language model, introduced in the paper:
"Do Better Language Models Have Crisper Vision?"
Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

📄 Read the Paper on arXiv
🤗 Model Checkpoints on Hugging Face
🌐 More Information on our Project Page


🧪 Overview

Workflow Diagram

ShareLock is a straightforward and efficient approach to building vision-language models. By leveraging frozen features from strong unimodal vision and language models, it achieves competitive multimodal performance with minimal computational resources. Key highlights include:

  • Data Efficiency: Trained on just 563k image-caption pairs, ShareLock achieves 51% zero-shot accuracy on ImageNet.
  • Cost Efficiency: Training requires only 1 GPU hour (10 hours including feature precomputation).
  • Competitive Results: Outperforms existing models in low-data regimes while maintaining scalability.

🚀 Features

  • Ultra-Lightweight: Minimal training time with competitive results.
  • Pretrained Backbone: Leverages strong, frozen unimodal features.
  • Low Resource Requirement: Trainable with only one GPU in hours.
  • Zero-Shot Capabilities: Effective on ImageNet and beyond.
  • CLIP-like VLM: apply common refinement techniques (e.g., prompt tuning, LLM-based descriptions, etc.)

🛠️ Installation

  1. Clone the Repository:

    git clone https://github.com/JonaRuthardt/ShareLock.git
    cd ShareLock
  2. Set up a Python Environment:

    python -m venv env
    source env/bin/activate  # On Windows: env\Scripts\activate
  3. Install Dependencies:

    pip install -r requirements.txt

📦 Usage

  1. Download Datasets: Training and validation of the model requires the presence of paired image-caption data. Popular small-scale datasets include CC3M, CC12M and YFCC15M and can be downlaoded in webdataset format using the img2dataset library.

  2. Precompute Features: Use pretrained models to extract vision and text embeddings:

    python precompute_image_features_hf.py # Hugginface datasets for classification tasks (test of final model)
    python precompute_image_features_hf.py # Image-caption datasets in webdataset format (training and validation)
    python precompute_language_features.py # JSON file containing caption for each uid in image dataset

    The dataset and backbone model to be used can be configured in the respective files via command line arguments. The precomputed features will be stored via the FeatureUtils library. The presence of a json file with iamge uids as keys and the corresponding captions as values is assumed. This information is consequently read and processed by the precompute_language_features.py script.

  3. Train the Projection Network: The image and text features are aligned by running:

    python train.py

    Settings and hyperparameters can be specified or changed in configs/default_config.yaml.

  4. Evaluate Model on VLM Taks: The ShareLock class implements the encode_text and encode_image functions which can be used for inference on downstream vision-language modeling tasks.


📂 Pretrained Model Checkpoints

We provide pretrained checkpoints for ShareLock on Hugging Face for easy integration and experimentation:

  • ShareLock (CC3M-trained): Hugging Face
  • ShareLock (CC12M-trained): Hugging Face

You can load these models directly using the ShareLock class:

from sharelock.models.model import ShareLock

model = ShareLock.load_from_checkpoint("path/to/checkpoint.ckpt", config=config)

Alternatively, the --checkpoint flag can be passed to the train.py file.


📊 Results

Our reported results were obtained via the CLIP-Benchmark codebase. A subset of classification results in presented in the following table:

Zero-shot classification on ImageNet variants:

Model Dataset IN-1k IN-R IN-A
CLIP CC3M 16.0% 17.6% 3.6%
LiT CC3M 44.1% 62.7% 45.6%
ShareLock CC3M 52.1% 64.1% 50.9%
CLIP CC12M 41.6% 52.6% 3.6%
LiT CC12M 56.2% 70.3% 52.8%
ShareLock CC12M 59.1% 68.8% 53.4%

For a comprehensive and detailed evaluation of ShareLock across various vision-language-modelling tasks, see our paper.


📜 Citation

If you use this work, please cite:

@article{ruthardt2024sharelock,
  title={Do Better Language Models Have Crisper Vision?},
  author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
  journal={arXiv preprint arXiv:2410.07173},
  year={2024}
}

📧 Contact

For any questions or collaborations, contact Jona Ruthardt.


📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Official implementation of ShareLock, an ultra-lightweight CLIP-like vision-language model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages