Transformer-based Image Captioning

Abstract

In this project, we apply the Transformer architecture to the image captioning task. We combine a Vision Transformer (ViT) encoder with a standard Transformer decoder to generate captions for input images. Our model is trained from scratch on the Flickr8k dataset, demonstrating that the ViT encoder can effectively capture visual features without the need for large-scale pre-training.

Introduction

The Transformer architecture has shown remarkable performance in various natural language processing tasks. In this project, we explore its application to the image captioning task by combining a Vision Transformer (ViT) encoder (with a Shifted Patch Tokenization (SPT) and a Locality Self-Attention (LSA)), with a standard Transformer decoder.

Unlike the original ViT, which relies on pre-training using large-size datasets, we train our model from scratch on the Flickr8k dataset.

Method

Encoder: Vision Transformer (ViT)

The ViT encoder processes input images by splitting them into patches and linearly embedding these patches. The resulting patch embeddings are fed into a Transformer encoder, which uses multi-head self-attention and feed-forward neural networks to capture visual features.

Decoder: Standard Transformer

The Transformer decoder generates output captions autoregressively, token by token. It takes the encoded visual features and previously generated tokens as input and uses multi-head self-attention, encoder-decoder attention, and feed-forward neural networks to generate the next token.

Dataset

We train and evaluate our model on the Flickr8k dataset, which consists of 8,000 images, each paired with 5 human-annotated captions.

How to Use

To install the depedencies

pip install -r  requirements.txt
python -m spacy download en_core_web_sm

To train the model, run the following command:

python training.py --epochs 100 --height 224 --width 224 --patch_size 16

Citation

@article{lee2021vision,
  title={Vision Transformer for Small-Size Datasets},
  author={Lee, Seung Hoon and Lee, Seunghyun and Song, Byung Cheol},
  journal={arXiv preprint arXiv:2112.13492},
  year={2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transformer-based Image Captioning

Abstract

Introduction

Method

Encoder: Vision Transformer (ViT)

Decoder: Standard Transformer

Dataset

How to Use

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transformer-based Image Captioning

Abstract

Introduction

Method

Encoder: Vision Transformer (ViT)

Decoder: Standard Transformer

Dataset

How to Use

Citation