-
Notifications
You must be signed in to change notification settings - Fork 1
/
abstract_multiple-choice.json
1 lines (1 loc) · 4.54 KB
/
abstract_multiple-choice.json
1
[{"team_members": "Damien Teney (University of Adelaide), Lingqiao Liu (University of Adelaide), Anton van den Hengel (University of Adelaide)", "standard": {"overall": 74.37, "perAnswerType": {"other": 68.31, "number": 74.97, "yes/no": 79.74}}, "team_name_order": 1, "submissionRound": 3, "team_name": "ACVT_Adelaide", "ref": "http://arxiv.org/abs/1609.05600", "method": "VQA with graph representations of scene and question, using language pre-parsing and pretrained word embeddings."}, {"team_members": "Kuniaki Saito, Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada", "standard": {"overall": 71.18, "perAnswerType": {"other": 67.93, "number": 56.19, "yes/no": 79.59}}, "team_name_order": 2, "submissionRound": 4, "team_name": "MIL", "ref": "", "method": "For holistic features, we have last fc layer from Resnet-152, fc7 from VGG-19, and features as described in Zhang et al 2016. For region features, we alternate between two: 1) Extract top 10 regions from Deep Proposal, get their softmax probs for 201 classes on ILSVRC detection task using Fast-RCNN and VGG-16 trained for the task, and average them out. 2) Get hundreds of region proposals from selective search, get their fc7 features, PCA them to 256-dim, add 8-dim coordinate info for each region, and apply VLAD coding with one cluster. We applied 1) for yes/no and number, and 2) for others, determining the type of questions by keyword extraction."}, {"team_members": "Jin-Hwa Kim (Seoul National University), Sang-Woo Lee (Seoul National University), Dong-Hyun Kwak (Seoul National University), Min-Oh Heo (Seoul National University), Jeonghee Kim (Naver Labs, Naver Corp.), Jung-Woo Ha (Naver Labs, Naver Corp.), Byoung-Tak Zhang (Seoul National University)", "standard": {"overall": 67.99, "perAnswerType": {"other": 61.99, "number": 52.57, "yes/no": 79.08}}, "team_name_order": 3, "submissionRound": 2, "team_name": "snubi-naverlabs", "ref": "http://goo.gl/ZYQHR0", "method": "A single multimodal residual networks three-block layered without data augmentation. GRUs initialized with Skip-Thought Vectors for question embedding and ResNet-152 for extracting visual feature vectors from abstract images are used. Joint representations are learned by element-wise multiplication, which leads to implicit attentional model without attentional parameters."}, {"team_members": "", "standard": {"overall": 29.15, "perAnswerType": {"other": 1.67, "number": 0.22, "yes/no": 64.9}}, "team_name_order": 4, "submissionRound": 1, "team_name": "vt-all_yes", "ref": "", "method": ""yes" (prior) is picked as the predicted answer for all questions"}, {"team_members": "Peng Zhang (Virginia Tech), Yash Goyal (Virginia Tech), Douglas Summers-Stay (Army Research Laboratory), Dhruv Batra (Virginia Tech), Devi Parikh (Virginia Tech)", "standard": {"overall": 35.25, "perAnswerType": {"other": 1.31, "number": 0.21, "yes/no": 79.14}}, "team_name_order": 5, "submissionRound": 1, "team_name": "vt_arl_binary", "ref": "http://arxiv.org/pdf/1511.05099v4.pdf", "method": "We first identify primary object and secondary object from questions, which tell us which regions should be paid attention on images. And we extract image features based on that. Then we verify the visual concepts by encoding the questions via LSTM, combing image features, and feeding into MLP."}, {"team_members": "Yash Goyal (Virginia Tech), Peng Zhang (Virginia Tech), Dhruv Batra (Virginia Tech), Devi Parikh (Virginia Tech)", "standard": {"overall": 69.21, "perAnswerType": {"other": 66.65, "number": 52.9, "yes/no": 77.46}}, "team_name_order": 6, "submissionRound": 1, "team_name": "vt_qLSTM-globalImage", "ref": "http://arxiv.org/abs/1511.05099", "method": "This model uses holistic image features for abstract scenes such as objects occurrence, categories occurrence, instances (for large and small objects), expressions and poses (for humans), and LSTM embedding for questions. Question and image features are point-wise multiplied and passed though a 2-layer MLP to obtain softmax distribution over most frequent 270 answers in the training dataset."}, {"team_members": "Yash Goyal (Virginia Tech), Peng Zhang (Virginia Tech), Dhruv Batra (Virginia Tech), Devi Parikh (Virginia Tech)", "standard": {"overall": 61.41, "perAnswerType": {"other": 49.19, "number": 49.65, "yes/no": 76.9}}, "team_name_order": 7, "submissionRound": 1, "team_name": "vt_qLSTMalone", "ref": "http://arxiv.org/abs/1511.05099", "method": "This model extracts LSTM embedding for questions, passes them though a 2-layer MLP to obtain softmax distribution over most frequent 270 answers in the training dataset."}, {"date": "2018-07-28"}]