We recommended to use Anaconda for the following packages.
import nltk
> d punkt
Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. To use full image encoders, download the images from their original sources here, here and here.
wget http://lsa.pucrs.br/jonatas/seam-data/irv2_precomp.tar.gz
wget http://lsa.pucrs.br/jonatas/seam-data/resnet152_precomp.tar.gz
wget http://lsa.pucrs.br/jonatas/seam-data/vocab.tar.gz
** Models not avaiable yet.
Run train.py
python train.py --data_name resnet152_precomp --logger_name runs/model --text_encoder gru --max_violation --lr_update 10 --learning_rate 1e-4 --resume /models/txt_enc.tar --resume2 models/txt_enc_epoch_600.pth
from vocab import Vocabulary
import evaluation
evaluation.evalrank("$RUN_PATH/model_best.pth.tar", data_path="$DATA_PATH", split="test", fold5=True)'
To do cross-validation on MSCOCO, pass fold5=True
with a model trained using
--data_name coco
If you found this code useful, please cite the following papers:
title={Fast Self-Attentive Multimodal Retrieval},
author={Wehrmann, Jônatas and Armani, Maurício and More, Martin D. and Barros, Rodrigo C.},
journal={IEEE Winter Conf. on Applications of Computer Vision (WACV'18)},
title={VSE++: Improved Visual-Semantic Embeddings},
author={Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja},
journal={arXiv preprint arXiv:1707.05612},