Skip to content

Bazs/automatic_image_captioning

Repository files navigation

Automatic Image Captioning

A neural network architecture which automatically generates captions from images. The architecture and hyperparameters are inspired the work of Vinyals et al. [1] and Xu et al. [2].

This is my submission for the image captioning project in the Udacity Computer Vision Nanodegree.

My nanodegree certificate: https://confirm.udacity.com/SYAMKDHY

Requirements

The model was developed in cloud-hosted JupyterLab environment, with some custom packages from Udacity available. The requirements.txt file can help you get started reproducing the results. The trained model weights are also included in the models/ folder. The weights are stored using Git LFS, which needs to be installed before checking out the repository.

Network Architecture

architecture

Image Captioning Model

The model is based on a decoder-endcoder architecture. The encoder is a ResNet-50 model trained on the ImageNet dataset. Its final layer is connected to an embedding layer, whose output serves as the initial input to the LSTM-based RNN decoder.

For the full model architecture please refer to model.py.

Training

I have kept the pre-trained ResNet weights frozen during training. The embedding layer an the decoder were trained from scratch using the COCO dataset.

For details on hyperparameter choices, please refer to 2_Training.pdf.

Inference

Inference is implemented using Sampling, where at each step of the RNN, the word with the highest softmax probability is selected as output, and is used as input for the next step after passing it through an embedding layer. Refer to 3_Inference.pdf for example outputs on the test dataset.

References

  1. Oriol Vinyals and Alexander Toshev and Samy Bengio and Dumitru Erhan.   Show and Tell: A Neural Image Caption Generator. In arXiv:1411.4555, 2015.
  2. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel and Y. Bengio.   Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In arXiv:1502.03044, 2016.

About

Image captioning using recurrent neural networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published