Skip to content

Latest commit

 

History

History
53 lines (36 loc) · 2.72 KB

README.md

File metadata and controls

53 lines (36 loc) · 2.72 KB

Transformer-based Image Captioning

pytorch version torchvision version numpy version PIL version

Abstract

In this project, we apply the Transformer architecture to the image captioning task. We combine a Vision Transformer (ViT) encoder with a standard Transformer decoder to generate captions for input images. Our model is trained from scratch on the Flickr8k dataset, demonstrating that the ViT encoder can effectively capture visual features without the need for large-scale pre-training.

Introduction

The Transformer architecture has shown remarkable performance in various natural language processing tasks. In this project, we explore its application to the image captioning task by combining a Vision Transformer (ViT) encoder (with a Shifted Patch Tokenization (SPT) and a Locality Self-Attention (LSA)), with a standard Transformer decoder.

Unlike the original ViT, which relies on pre-training using large-size datasets, we train our model from scratch on the Flickr8k dataset.

Method

Encoder: Vision Transformer (ViT)

The ViT encoder processes input images by splitting them into patches and linearly embedding these patches. The resulting patch embeddings are fed into a Transformer encoder, which uses multi-head self-attention and feed-forward neural networks to capture visual features.

Decoder: Standard Transformer

The Transformer decoder generates output captions autoregressively, token by token. It takes the encoded visual features and previously generated tokens as input and uses multi-head self-attention, encoder-decoder attention, and feed-forward neural networks to generate the next token.

Dataset

We train and evaluate our model on the Flickr8k dataset, which consists of 8,000 images, each paired with 5 human-annotated captions.

How to Use

To install the depedencies

pip install -r  requirements.txt
python -m spacy download en_core_web_sm

To train the model, run the following command:

python training.py --epochs 100 --height 224 --width 224 --patch_size 16

Citation

@article{lee2021vision,
  title={Vision Transformer for Small-Size Datasets},
  author={Lee, Seung Hoon and Lee, Seunghyun and Song, Byung Cheol},
  journal={arXiv preprint arXiv:2112.13492},
  year={2021}
}