Skip to content

Scene-Description-Team/Scene-Description

Repository files navigation

Scene-Description

Being able to automatically describe the content of an image using properly formed English sentences is a very challenging task, but it could have a great impact, for instance by helping visually impaired people better understand the content of images on the web. The problem introduces a captioning task; we planned to do image-to-sentence generation. This application bridges vision and natural language, which requires a computer vision system to both localize and describe salient regions in images in natural language. The image captioning task generalizes object detection when the descriptions consist of a single word. Given a set of images and prior knowledge about the content, find the correct semantic label for the entire image.

Image-Captioning model:

Dataset used:

For our task, we used MSCOCO-2017, it contains 118K images each with approximately 5 different human-annotated captions.

Data Pre-processing:

The preprocessing phase can be split into three main procedures:
1. Creating the captions vocabulary:
First we added and tokens to each caption, then we created a vocabulary that contains the 5000 most frequent words in all captions.
2. Image preprocessing and feature extraction:
We first resized the images into (224, 244) to be compatible with the VGG-16 input layer, then the images were converted from RGB to BGR, and each color channel is zero-centered. We then used a pretrained VGG-16 model to extract the features from these pre-processed images and stored them to be later passed to our model.

Captions preprocessing Image preprocessing
image image

Model Architecture:

Our image-captioning model follows the same architecture as the one proposed in the famous “Show, attend and tell” paper; it's a neural network that consists of a CNN encoder that extracts the most important features of the input image and an RNN decoder that produces the next word in the caption at each time-step and it utilizes the Bahdanau’s additive attention to focus on different parts of the image when producing each word.
image

Model Training:

All of the model training was done using local gpu (nvidia gtx 1060 with 6GB).
We used the teacher forcing technique, where we compare the word that the model produced with the correct word that is given in the target caption and compute the losses and then give the correct word to the next decoder unit.
While during inference, we give the word that the model to the next decoder unit to produce the next word.

Training Inference
image image

Model Deployment:

We used plotly-dash library to deploy our model, also we added a clear dashboard to show the model architecture and the structure of our project. image

References:

  1. Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. “Show and tell: A neural image caption generator.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156- 3164. [2015].
  2. Xu, Kelvin, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio.“Show, attend and tell: Neural image caption generation with visual attention.” arXiv preprint arXiv:1502.03044 [2015].
  3. Karen Simonyan, Andrew Zisserman: Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR [2015].
  4. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS [2014].
  5. Karpathy, Andrej, and Li Fei-Fei. “Deep visual semantic alignments for generating image descriptions” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137. [2015].
  6. Tensorflow image-captioning tutorial.

Team Members:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages