Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
This repository comprises leaderboards, dataset and paper lists of Video-Language Understanding. This provides supplementary information for our survey paper Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives published at ACL 2024 (Findings). If you found any error, please don't hesitate to open an issue or pull request.
If you are interested in our survey, please cite as
@article{nguyen2024video,
title={Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives},
author={Nguyen, Thong and Bin, Yi and Xiao, Junbin and Qu, Leigang and Li, Yicong and Wu, Jay Zhangjie and Nguyen, Cong-Duy and Ng, See-Kiong and Tuan, Luu Anh},
journal={arXiv preprint arXiv:2406.05615},
year={2024}
}
Methods | Model architecture | Video | Text | R@1 | R@5 | R@10 |
---|---|---|---|---|---|---|
[VSE-LSTM-Neurips2014] | Pre-TF | ConvNet/OxfordNet | GloVe-LSTM | 3.8 | 12.7 | 17.1 |
[C+LSTM+SA-FC7-arXiv2016] | Pre-TF | VGG | GloVe-LSTM | 4.2 | 12.9 | 19.9 |
[EITanque-arXiv2016] | Pre-TF | VGG | word2vec-LSTM | 4.7 | 16.6 | 24.1 |
[SA-G+SA-FC7-arxiv2016] | Pre-TF | VGG | GloVe | 3.1 | 9.0 | 13.4 |
[CT-SAN-CVPR2017] | Pre-TF | RN | word2vec-LSTM | 4.4 | 16.6 | 22.3 |
[JSFusion-ECCV2018] | Pre-TF | RN | GloVe-LSTM | 10.2 | 31.2 | 43.2 |
[All-in-one-arXiv2022] | Shared TF | Linear | BT | 37.9 | 68.1 | 77.1 |
[VLM-ACL2021] | Shared TF | S3D | BT | 28.1 | 55.5 | 67.4 |
[DeCEMBERT-NAACL2021] | Shared TF | RN | BT | 17.5 | 44.3 | 58.6 |
[ActBERT-arXiv2020] | Stacked TF | Faster-RCNN | BT | 16.3 | 42.8 | 56.9 |
[VIOLET-CVPR2023] | Stacked TF | VS-TF | BT | 37.2 | 64.8 | 75.8 |
[VindLU-CVPR2023] | Stacked TF | ViT | BT | 48.8 | 72.4 | 82.2 |
[HERO-EMNLP2020] | Stacked TF | RN+SlowFast | BT | 16.8 | 43.4 | 57.7 |
[MV-GPT-arXiv2022] | Stacked TF | ViViT | BT | 37.3 | 65.5 | 75.1 |
[CLIP2TV-ICLR2023] | Dual TF | ViT | CLIP-text | 32.4 | 58.2 | 68.6 |
[CLIP-ViP-ICLR2023] | Dual TF | ViT | CLIP-text | 49.6 | 74.5 | 84.8 |
[CLIP4Clip-arXiv2021] | Dual TF | ViT | CLIP-text | 44.5 | 71.4 | 81.6 |
Methods | Architecture | Video | BLEU-4 | METEOR | CIDEr |
---|---|---|---|---|---|
[TA-ICCV2015] | Pre-TF | Video: 3D-CNN | 36.5 | 25.7 | / |
[h-RNN-CVPR2016] | Pre-TF | Video: VGG | 36.8 | 25.9 | / |
[MFATT-arXiv2016] | Pre-TF | Video: RN+C3D | 39.1 | 26.7 | / |
[CAT-TM-arXiv2016] | Pre-TF | Video: RN+C3D | 36.6 | 25.6 | / |
[NFS-TM-arXiv2016] | Pre-TF | Video: RN+C3D | 37.0 | 25.9 | / |
[Fuse-TM-arXiv2016] | Pre-TF | Video: RN+C3D | 37.5 | 25.9 | / |
[MARN-CVPR2019] | Pre-TF | Video: RN | / | / | 46.8 |
[Res-ATT-WWW2019] | Pre-TF | Video: RN | 37.0 | 26.9 | 40.7 |
[DenseLSTM-ACMM2019] | Pre-TF | Video: VGG | 38.1 | 27.2 | 42.8 |
[VIOLET-CVPR2023] | Stacked TF | VS-TF | / | / | 58.0 |
[LAVENDER-arXiv2023] | Stacked TF | VS-TF | / | / | 57.4 |
[VLAB-arXiv2023] | Stacked TF | EVA-G | 54.6 | 33.4 | 74.9 |
[UniVL-arXiv2020] | Stacked TF | S3D | 41.8 | 28.9 | 50.0 |
[MV-GPT-arXiv2022] | Stacked TF | ViViT | 48.9 | 38.7 | 60.0 |
[CLIP-DCD-PRCV2022] | Stacked TF | ViT | 48.2 | 30.9 | 64.8 |
[DeCEMBERT-NAACL2021] | Stacked TF | RN | 45.2 | 29.7 | 52.3 |
[mPLUG-2-ICML2023] | Stacked TF | ViT | 57.8 | 34.9 | 80.3 |
Methods | Architecture | Video | Text | MSRVTT | MSVD |
---|---|---|---|---|---|
[E-MN-ACMM2017] | Pre-TF | VGG + C3D | GloVe-LSTM | 30.4 | 26.7 |
[QueST-AAAI2020] | Pre-TF | RN + C3D | GloVe-LSTM | 40.0 | / |
[HME-CVPR2019] | Pre-TF | RN/VGG + C3D | GloVe-GRU | 34.6 | 36.1 |
[HGA-AAAI2020] | Pre-TF | RN/VGG + C3D | GloVe-GRU | 33.0 | 33.7 |
[ST-VQA-IJCV2019] | Pre-TF | RN+C3D | GloVe-LSTM | 35.5 | 34.7 |
[PGAT-ACMMM2021] | Pre-TF | Faster-RCNN | GloVe-LSTM | 38.1 | 39.0 |
[HCRN-CVPR2020] | Pre-TF | RN | GloVe-LSTM | 38.6 | 41.2 |
[All-in-one-arXiv2022] | Shared TF | Linear | BT | 44.3 | 47.9 |
[LAVENDER-arXiv2022] | Stacked TF | VS-TF | BT | 45.0 | 56.6 |
[DeCEMBERT-NAACL2021] | Stacked TF | RN | BT | 37.4 | / |
[VindLU-CVPR2023] | Stacked TF | ViT | BT | 44.6 | / |
[VIOLET-CVPR2023] | Stacked TF | VS-TF | BT | 44.5 | 54.7 |
[ClipBERT-CVPR2021] | Stacked TF | CLIP-text | BT | 37.4 | / |
[VGT-ECCV2022] | Dual TF | Faster-RCNN | BT | 39.7 | / |
[CoVGT-TPAMI2023] | Dual TF | Faster-RCNN | BT | 40.0 | / |
[Video-ChatGPT-arXiv2023] | LLM-Augmented | ViT | Vicuna | 49.3 | 64.9 |
[LLaMA-Vid-arXiv2023] | LLM-Augmented | EVA-G | Vicuna | 58.9 | 70.0 |
Dataset | Links | Video source | Annotation | Tasks | #Videos/#Scenes |
---|---|---|---|---|---|
MSVD |
[Paper], [Dataset] | YouTube videos | Manual | TVR, VC, VideoQA | 1.9K |
MSRVTT |
[Paper], [Dataset] | Web videos | Manual | TVR, VC, VideoQA | 7.2K |
ActivityNet |
[Paper], [Dataset] | YouTube videos | Manual | AL, TVR, VC, VMR | 5.8K |
FIBER |
[Paper], [Dataset] | [VaTeX] | Manual | VC, VideoQA | 28K |
WildQA |
[Paper], [Dataset] | YouTube videos | Manual | VideoQA | 0.4K |
NExT-QA |
[Paper], [Dataset] | [VidOR] | Manual | VideoQA | 5.4K |
CausalVid-QA |
[Paper], [Dataset] | [Kinetics-700] | Manual | VideoQA | 26K |
HowTo100M |
[Paper], [Dataset] | YouTube videos | Auto | PT | 1.2M |
HD-VILA-100M |
[Paper], [Dataset] | YouTube videos | Auto | PT | 3.3M |
YT-Temporal-180M |
[Paper], [Dataset] | YouTube videos | Auto | PT | 6M |
TGIF-QA |
[Paper], [Dataset] | Animated TGIFs | Manual | VideoQA | 71K |
TGIF-QA-R |
[Paper], [Dataset] | [TGIF-QA] | Manual, Auto | VideoQA | 71K |
DiDeMo |
[Paper], [Dataset] | [YFCC100M] | Manual | TVR | 11K |
YouCook2 |
[Paper], [Dataset] | YouTube videos | Manual | TVR, VC | 2K |
HMDB-51 |
[Paper], [Dataset] | Web videos | Manual | TVR, AR | 6.8K |
Kinetics-400 |
[Paper], [Dataset] | YouTube videos | Manual | AR | 306K |
Kinetics-600 |
[Paper], [Dataset] | [Kinetics-400] | Manual | AR, VG | 480K |
Kinetics-700 |
[Paper], [Dataset] | [Kinetrics-600] | Manual | AR | 650K |
VaTeX |
[Paper], [Dataset] | [Kinetrics-600] | Manual | TVR, VC | 41K |
TVR |
[Paper], [Dataset] | [TVQA] | Manual | VMR | 22K |
How2R |
[Paper], [Dataset] | [HowTo100M] | Manual | VMR | 22K |
How2QA |
[Paper], [Dataset] | [HowTo100M] | Manual | VideoQA | 22K |
YouTube Highlights |
[Paper], [Dataset] | YouTube videos | Manual | VMR | 0.6K |
TACoS |
[Paper], [Dataset] | [MPII Composites] | Manual | VMR | 0.1K |
QVHighlights |
[Paper], [Dataset] | YouTube vlogs | Manual | VMR | 10K |
TVSum |
[Paper], [Dataset] | YouTube videos | Manual | VMR | 50 |
ViTT |
[Paper], [Dataset] | [YouTube-8M] | Manual | VMR | 5.8K |
VidChapters-7M |
[Paper], [Dataset] | [YT-Temporal-180M] | Auto | VC, VMR | 817K |
VideoCC3M |
[Paper], [Dataset] | Web videos | Auto | PT | 6.3M |
WebVid-10M |
[Paper], [Dataset] | Web videos | Auto | PT | 10.7M |
COIN |
[Paper], [Dataset] | YouTube videos | Manual | AS | 12K |
CrossTask |
[Paper], [Dataset] | YouTube videos | Manual | AR | 4.7K |
Alivol-10M |
[Paper] | E-commerce videos | Auto | PT | 10M |
LSMDC |
[Paper], [Dataset] | British movies | Manual | TVR | 72 |
EK-100 |
[Paper], [Dataset] | Manual | Manual | AR, AL | 7K |
SSV1 |
[Paper], [Dataset] | Manual | Manual | AR | 108K |
SSV2 |
[Paper], [Dataset] | Manual | Manual | AR | 221K |
Moments in Time |
[Paper], [Dataset] | Web videos | Manual | AR | 1M |
InternVid |
[Paper], [Dataset] | YouTube videos | Auto | PT | 7.1M |
How2 |
[Paper], [Dataset] | YouTube videos | Auto | VC | 13.2K |
WTS70M |
[Paper] | YouTube videos | Auto | PT | 70M |
Charades |
[Paper], [Dataset] | Manual | Manual | AR, VMR, VideoQA | 10K |
- Survey: Transformer based video-language pre-training
arXiv 2021
[Paper] - Self-supervised learning for videos: A survey
ACM Computing Survey 2022
[Paper] [Code] - Video question answering: Datasets, algorithms and challenges
EMNLP 2022
[Paper] [Code] - Deep learning for video-text retrieval: a review
IJMIR 2023
[Paper] - A review of deep learning for video captioning
arXiv 2023
[Paper] - Video question answering: a survey of models and datasets
Mobile Networks and Applications
2021 [Paper]
- Video question answering via attribute-augmented attention network learning
SIGIR 2017
[Paper] [Code] - Convolutional Two-Stream Network Fusion for Video Action Recognition
CVPR 2016
[Paper] [Code] - Tensor-train recurrent neural networks for video classifcation
arXiv 2017
[Paper] [Code] - Two-stream rnn/cnn for action recognition in 3d videos
IROS 2017
[Paper] [Code] - Convnet architecture search for spatiotemporal feature learning
arXiv 2017
[Paper] [Code] - A joint sequence fusion model for video question answering and retrieval
ECCV 2018
[Paper] - Learning language-visual embedding for movie understanding with natural-language
arXiv 2016
[Paper] - Unifying visual-semantic embeddings with multimodal neural language models
NeurIPS 2014
[Paper] - Temporal tessellation for video annotation and summarization
arXiv 2016
[Paper] [Code] - End-to-end concept word detection for video captioning, retrieval, and question answering
CVPR 2017
[Paper] - Video captioning with multi-faceted attention
arXiv 2016
[Paper] - Describing videos by exploiting temporal structure
ICCV 2015
[Paper] [Code] - Video paragraph captioning using hierarchical recurrent neural networks
CVPR 2016
[Paper] - Localizing moments in video with natural language
ICCV 2017
[Paper] - Video question answering via attribute-augmented attention network learning
SIGIR 2017
[Paper] - Hierarchical boundary-aware neural encoder for video captioning
CVPR 2017
[Paper] - Tall: Temporal activity localization via language query
ICCV 2017
[Paper] [Code] - Leveraging video descriptions to learn video question answering
AAAI 2017
[Paper]
- VATT: Transformers for multimodal selfsupervised learning from raw video, audio and text
NeurIPS 2021
[Paper] [Code] - Lavender: Unifying video-language understanding as masked language modeling
arXiv 2022
[Paper] [Code] - All in one: Exploring unifed video-language pretraining
CVPR 2023
[Paper] [Code] - An empirical study of end-to-end video-language transformers with masked visual modeling
CVPR 2023
[Paper] [Code] - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
[Paper] [Code] - Vindlu: A recipe for effective video-and-language pretraining
CVPR 2023
[Paper] [Code] - Less is more: Clipbert for video-and-language learning via sparse sampling
CVPR 2021
[Paper] [Code]
- HERO: Hierarchical encoder for video+ language omni-representation pretraining
EMNLP 2020
[Paper] [Code] - End-to-end generative pretraining for multimodal video captioning
arXiv 2022
[Paper] - VLAB: Enhancing video language pre-training by feature adapting and blending
arXiv 2023
[Paper] - UniVL: A unifed video and language pre-training model for multimodal understanding and generation
arXiv 2020
[Paper] [Code] - CLIP meets video captioning: Concept-aware representation learning does matter
PRCV 2022
[Paper] [Code] - mPLUG-2: A modularized multimodal foundation model across text, image and video
ICML 2023
[Paper] [Code]
- CLIP-ViP: Adapting pre-trained image-text model to videolanguage representation alignment
ICLR 2023
[Paper] [Code] - CLIP4Clip: An empirical study of clip for end to end video clip retrieval and captioning
arXiv 2021
[Paper] [Code] - Video graph transformer for video question answering
ECCV 2022
[Paper] [Code] - Contrastive video question answering via video graph transformer
TPAMI 2023
[Paper] [Code] - Frozen in time: A joint video and image encoder for end-to-end retrieval
ICCV 2021
[Paper] [Code] - A CLIP-Hitchhiker’s guide to long video retrieval
arXiv 2022
[Paper] [Code] - ECLIPSE: Efficient long-range video retrieval using sight and sound
ECCV 2022
[Paper] [Code] - VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
EMNLP 2021
[Paper] [Code]
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding
EMNLP 2023
[Paper] [Code] - VideoChat: Chat-centric video understanding
arXiv 2023
[Paper] [Code] - VideoLLM: Modeling video sequence with large language models
arXiv 2023
[Paper] [Code] - LlaMA-VID: An image is worth 2 tokens in large language models
arXiv 2023
[Paper] [Code] - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
arXiv 2023
[Paper]
- CLIP2TV: Align, match and distill for video-text retrieval
arXiv 2021
[Paper] - Understanding chinese video and language via contrastive multimodal pre-training
arXiv 2021
[Paper] - DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization
NAACL 2021
[Paper] [Code] - VideoBERT: A joint model for video and language representation learning
ICCV 2019
[Paper] [Code] - Learning video representations using contrastive bidirectional transformer
arXiv 2019
[Paper] - MERLOT: Multimodal neural script knowledge models
NeurIPS 2021
[Paper] [Code] - Revealing single frame bias for video-and-language learning
arXiv 2022
[Paper] [Code] - ActBERT: Learning Global-Local Video-Text Representations
CVPR 2020
[Paper] [Code]
- Multilevel language and vision integration for text-to-clip retrieval
AAAI 2019
[Paper] [Code] - ST-Adapter: Parameter-efficient image-to-video transfer learning
NeurIPS 2022
[Paper] [Code] - Zero-shot video question answering via frozen bidirectional language models
NeurIPS 2022
[Paper] [Code] - Attentive Moment Retrieval in Videos
SIGIR 2018
[Paper] - To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression
arXiv 2018
[Paper] - Cross-Modal Adapter for Text-Video Retrieval
arXiv 2022
[Paper] [Code] - AIM: Adapting Image Models for Efficient Video Action Recognition
ICLR 2023
[Paper] [Code] - Prompting Visual-Language Models for Efficient Video Understanding
ECCV 2022
[Paper] [Code] - Multi-modal Circulant Fusion for Video-to-Language and Backward
IJCAI 2018
[Paper] - Long-term temporal convolutions for action recognition
arXiv 2016
[Paper] [Code] - READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
AAAI 2024
[Paper] [Code]
- Advancing high-resolution video-language representation with large-scale video transcriptions
CVPR 2022
[Paper] [Code] - Howto100M: Learning a text-video embedding by watching hundred million narrated video clips
ICCV 2019
[Paper] [Code] - FIBER: Fill-in-the-blanks as a challenging video understanding evaluation framework
ACL 2022
[Paper] [Code] - NExT-QA: Next phase of questionanswering to explaining temporal actions
CVPR 2021
[Paper] [Code] - The "Something Something" Video Database for Learning and Evaluating Visual Common Sense
arXiv 2017
[Paper] [Code] - Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100
IJCV 2020
[Paper] [Code] - From representation to reasoning: Towards both evidence and commonsense reasoning for video question answering
CVPR 2022
[Paper] [Code] - Grounding Action Descriptions in Videos
TACL 2013
[Paper] - Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
CVPR 2018
[Paper] [Code] - Multimodal Pretraining for Dense Video Captioning
AACL-IJCNLP 2020
[Paper] [Code] - QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
NeurIPS 2021
[Paper] [Code] - VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
ICCV 2019
[Paper] [Code]
- SVFormer: Semisupervised video transformer for action recognition
CVPR 2023
[Paper] [Code] - Semi-supervised video paragraph grounding with contrastive encoder
CVPR 2022
[Paper] - Learning temporal action proposals with fewer labels
arXiv 2019
[Paper] - Self-supervised learning for semi-supervised temporal action proposal
CVPR 2021
[Paper] [Code] - Semi-Supervised Action Recognition with Temporal Contrastive Learning
CVPR 2021
[Paper] [Code] - Learning Action Proposals With Fewer Labels
arXiv 2019
[Paper]
- Collecting Highly Parallel Data for Paraphrase Evaluation
ACL 2011
[Paper] [Code] - MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
CVPR 2016
[Paper] [Code] - TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
CVPR 2017
[Paper] [Code] - Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
arXiv 2018
[Paper] [Code] - HMDB: A large video database for human motion recognition
ICCV 2011
[Paper] [Code] - The Kinetics Human Action Video Dataset
arXiv 2017
[Paper] [Code] - TVSum: Summarizing Web Videos Using Titles
CVPR 2015
[Paper] [Code] - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
ECCV 2020
[Paper] [Code] - COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
CVPR 2019
[Paper] [Code] - Cross-task weakly supervised learning from instructional videos
CVPR 2019
[Paper] [Code] - Moments in Time Dataset: one million videos for event understanding
CVPR 2019
[Paper] [Code]
- Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
ECCV 2024
[Paper] [Code] - Progressive Graph Attention Network for Video Question Answering
ACMMM 2021
[Paper] [Code] - The StreetLearn Environment and Dataset
arXiv 2019
[Paper] [Code] - VidChapters-7M: Video Chapters at Scale
NeurIPS 2023
[Paper] [Code] - Learning Audio-Video Modalities from Image Captions
arXiv 2022
[Paper] - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
arXiv 2022
[Paper] [Code] - How2: A Large-scale Dataset for Multimodal Language Understanding
NeurIPS 2018
[Paper] [Code] - Learning Video Representations from Textual Web Supervision
arXiv 2020
[Paper]