Visual_Question_Answering

1. Introduction

Visual Question Answering (VQA) is a common problem in Machine Learning, applying techniques related to two fields of Computer Vision and Natural Language Processing. The core concept of this problem is to analyze an image and answer a question about that image. The first step is to analyze the input information, including using image processing and natural language processing techniques to handle the question posed. Then, the VQA system will combine the information obtained from the image analysis and the context of the question to create a suitable answer. Therefore, a program with high accuracy needs to build well both of these components, posing a great challenge in solving the problem of question answering with images.

In this project, I will build a VQA program using Image Encoders (CNN, ViT, CLIP) for images and Text Encoders (LSTM, RoBERTa, CLIP) for natural language processing. The input and output of the program are as follows:

Input: A pair of image and question in natural language.
Output: An answer to the question about the image (Yes/No).

2. Dataset

You can download the vqa-coco-dataset here. After that, you should organize the folder structure as follows:

📁 Visual_Question_Answering
- 📁 data
  - 📂 val2014-resised
  - 📄 vaq2.0.TrainImages.txt
  - 📄 vaq2.0.DevImages.txt
  - 📄 vaq2.0.TestImages.txt
- 📁 CLIP
- 📁 CNN_LSTM
- 📁 ViT_RoBERTa
- 📁 src
- 📁 sample
- 🐍 main_CLIP.py
- 🐍 main_CNN_LSTM.py
- 🐍 main_ViT_RoBERTa.py
- 🐍 infer.py
- 📄 README.md

Some sample data in the VQA dataset in the form of Yes/No questions

3. Train models

First, clone this repo and organize the data as above.

git clone https://github.com/dinhquy-nguyen-1704/Visual_Question_Answering.git
cd Visual_Question_Answering

3.1. Requirements

pip install timm
pip install transformers
pip install open_clip_torch

3.2. CNN - LSTM

If you want to use CNN as Image Encoder and LSTM for Text Encoder:

python main_CNN_LSTM.py --cnn_model_name resnet50

3.3. ViT - RoBERTa

If you want to use ViT as Image Encoder, RoBERTa for Text Encoder:

python main_ViT_RoBERTa.py --img_feature_extractor_name google/vit-base-patch16-224 --text_tokenizer_name roberta-base

3.4. CLIP

If you want to use CLIP as the Encoders and MLP for the Classifier:

python main_CLIP.py --clip_model_type ViT-B-32 --clip_pretrained laion2b_e16

4. Result

The metric used in this task is accuracy, the result is evaluated on the Test set of the Dataset.

Image Encoder	Text Encoder	Accuracy
CNN	LSTM	54%
ViT	RoBERTa	63%
CLIP	CLIP	73%

5. Sample

I have prepared a sample folder that includes some pairs of images and related questions. You can quickly test the version using pretrained CLIP, download pretrained model and place it in the Visual_Question_Answering folder.

python infer.py --img_path "./sample/COCO_val2014_000000262376.jpg" --question "Is this a big building ?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual_Question_Answering

1. Introduction

2. Dataset

3. Train models

3.1. Requirements

3.2. CNN - LSTM

3.3. ViT - RoBERTa

3.4. CLIP

4. Result

5. Sample

6. References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
CLIP		CLIP
CNN_LSTM		CNN_LSTM
ViT_RoBERTa		ViT_RoBERTa
__pycache__		__pycache__
sample		sample
src		src
README.md		README.md
infer.py		infer.py
main_CLIP.py		main_CLIP.py
main_CNN_LSTM.py		main_CNN_LSTM.py
main_ViT_RoBERTa.py		main_ViT_RoBERTa.py

dinhquy-nguyen-1704/Visual_Question_Answering

Folders and files

Latest commit

History

Repository files navigation

Visual_Question_Answering

1. Introduction

2. Dataset

3. Train models

3.1. Requirements

3.2. CNN - LSTM

3.3. ViT - RoBERTa

3.4. CLIP

4. Result

5. Sample

6. References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages