Name		Name	Last commit message	Last commit date
parent directory ..
images		images
notebooks		notebooks
README.md		README.md
vlm_finetuning.md		vlm_finetuning.md
vlm_usage.md		vlm_usage.md

README.md

Vision Language Models

1. VLM Usage

Vision Language Models (VLMs) process image inputs alongside text to enable tasks like image captioning, visual question answering, and multimodal reasoning.

A typical VLM architecture consists of an image encoder to extract visual features, a projection layer to align visual and textual representations, and a language model to process or generate text. This allows the model to establish connections between visual elements and language concepts.

VLMs can be used in different configurations depending on the use case. Base models handle general vision-language tasks, while chat-optimized variants support conversational interactions. Some models include additional components for grounding predictions in visual evidence or specializing in specific tasks like object detection.

For more on the technicality and usage of VLMs, refer to the VLM Usage page.

2. VLM Fine-Tuning

Fine-tuning a VLM involves adapting a pre-trained model to perform specific tasks or to operate effectively on a particular dataset. The process can follow methodologies such as supervised fine-tuning, preference optimization, or a hybrid approach that combines both, as introduced in Modules 1 and 2.

While the core tools and techniques remain similar to those used for LLMs, fine-tuning VLMs requires additional focus on data representation and preparation for images. This ensures the model effectively integrates and processes both visual and textual data for optimal performance. Given that the demo model, SmolVLM, is significantly larger than the language model used in the previous module, it's essential to explore methods for efficient fine-tuning. Techniques like quantization and PEFT can help make the process more accessible and cost-effective, allowing more users to experiment with the model.

For detailed guidance on fine-tuning VLMs, visit the VLM Fine-Tuning page.

Exercise Notebooks

Title	Description	Exercise	Link	Colab
VLM Usage	Learn how to load and use a pre-trained VLM for various tasks	🐢 Process an image 🐕 Process multiple images with batch handling 🦁 Process a full video	Notebook
VLM Fine-Tuning	Learn how to fine-tune a pre-trained VLM for task-specific datasets	🐢 Use a basic dataset for fine-tuning 🐕 Try a new dataset 🦁 Experiment with alternative fine-tuning methods	Notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5_vision_language_models

5_vision_language_models

README.md

Vision Language Models

1. VLM Usage

2. VLM Fine-Tuning

Exercise Notebooks

References

Files

5_vision_language_models

Directory actions

More options

Directory actions

More options

Latest commit

History

5_vision_language_models

Folders and files

parent directory

README.md

Vision Language Models

1. VLM Usage

2. VLM Fine-Tuning

Exercise Notebooks

References