Skip to content

[ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Notifications You must be signed in to change notification settings

nguyentthong/video-language-understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 

Repository files navigation

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

This repository comprises leaderboards, dataset and paper lists of Video-Language Understanding. This provides supplementary information for our survey paper Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives published at ACL 2024 (Findings). If you found any error, please don't hesitate to open an issue or pull request.

If you are interested in our survey, please cite as

@article{nguyen2024video,
  title={Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives},
  author={Nguyen, Thong and Bin, Yi and Xiao, Junbin and Qu, Leigang and Li, Yicong and Wu, Jay Zhangjie and Nguyen, Cong-Duy and Ng, See-Kiong and Tuan, Luu Anh},
  journal={arXiv preprint arXiv:2406.05615},
  year={2024}
}

Resources


Leaderboards

Text-video retrieval

Methods Model architecture Video Text R@1 R@5 R@10
[VSE-LSTM-Neurips2014] Pre-TF ConvNet/OxfordNet GloVe-LSTM 3.8 12.7 17.1
[C+LSTM+SA-FC7-arXiv2016] Pre-TF VGG GloVe-LSTM 4.2 12.9 19.9
[EITanque-arXiv2016] Pre-TF VGG word2vec-LSTM 4.7 16.6 24.1
[SA-G+SA-FC7-arxiv2016] Pre-TF VGG GloVe 3.1 9.0 13.4
[CT-SAN-CVPR2017] Pre-TF RN word2vec-LSTM 4.4 16.6 22.3
[JSFusion-ECCV2018] Pre-TF RN GloVe-LSTM 10.2 31.2 43.2
[All-in-one-arXiv2022] Shared TF Linear BT 37.9 68.1 77.1
[VLM-ACL2021] Shared TF S3D BT 28.1 55.5 67.4
[DeCEMBERT-NAACL2021] Shared TF RN BT 17.5 44.3 58.6
[ActBERT-arXiv2020] Stacked TF Faster-RCNN BT 16.3 42.8 56.9
[VIOLET-CVPR2023] Stacked TF VS-TF BT 37.2 64.8 75.8
[VindLU-CVPR2023] Stacked TF ViT BT 48.8 72.4 82.2
[HERO-EMNLP2020] Stacked TF RN+SlowFast BT 16.8 43.4 57.7
[MV-GPT-arXiv2022] Stacked TF ViViT BT 37.3 65.5 75.1
[CLIP2TV-ICLR2023] Dual TF ViT CLIP-text 32.4 58.2 68.6
[CLIP-ViP-ICLR2023] Dual TF ViT CLIP-text 49.6 74.5 84.8
[CLIP4Clip-arXiv2021] Dual TF ViT CLIP-text 44.5 71.4 81.6

Video captioning

Methods Architecture Video BLEU-4 METEOR CIDEr
[TA-ICCV2015] Pre-TF Video: 3D-CNN 36.5 25.7 /
[h-RNN-CVPR2016] Pre-TF Video: VGG 36.8 25.9 /
[MFATT-arXiv2016] Pre-TF Video: RN+C3D 39.1 26.7 /
[CAT-TM-arXiv2016] Pre-TF Video: RN+C3D 36.6 25.6 /
[NFS-TM-arXiv2016] Pre-TF Video: RN+C3D 37.0 25.9 /
[Fuse-TM-arXiv2016] Pre-TF Video: RN+C3D 37.5 25.9 /
[MARN-CVPR2019] Pre-TF Video: RN / / 46.8
[Res-ATT-WWW2019] Pre-TF Video: RN 37.0 26.9 40.7
[DenseLSTM-ACMM2019] Pre-TF Video: VGG 38.1 27.2 42.8
[VIOLET-CVPR2023] Stacked TF VS-TF / / 58.0
[LAVENDER-arXiv2023] Stacked TF VS-TF / / 57.4
[VLAB-arXiv2023] Stacked TF EVA-G 54.6 33.4 74.9
[UniVL-arXiv2020] Stacked TF S3D 41.8 28.9 50.0
[MV-GPT-arXiv2022] Stacked TF ViViT 48.9 38.7 60.0
[CLIP-DCD-PRCV2022] Stacked TF ViT 48.2 30.9 64.8
[DeCEMBERT-NAACL2021] Stacked TF RN 45.2 29.7 52.3
[mPLUG-2-ICML2023] Stacked TF ViT 57.8 34.9 80.3

Video question answering

Methods Architecture Video Text MSRVTT MSVD
[E-MN-ACMM2017] Pre-TF VGG + C3D GloVe-LSTM 30.4 26.7
[QueST-AAAI2020] Pre-TF RN + C3D GloVe-LSTM 40.0 /
[HME-CVPR2019] Pre-TF RN/VGG + C3D GloVe-GRU 34.6 36.1
[HGA-AAAI2020] Pre-TF RN/VGG + C3D GloVe-GRU 33.0 33.7
[ST-VQA-IJCV2019] Pre-TF RN+C3D GloVe-LSTM 35.5 34.7
[PGAT-ACMMM2021] Pre-TF Faster-RCNN GloVe-LSTM 38.1 39.0
[HCRN-CVPR2020] Pre-TF RN GloVe-LSTM 38.6 41.2
[All-in-one-arXiv2022] Shared TF Linear BT 44.3 47.9
[LAVENDER-arXiv2022] Stacked TF VS-TF BT 45.0 56.6
[DeCEMBERT-NAACL2021] Stacked TF RN BT 37.4 /
[VindLU-CVPR2023] Stacked TF ViT BT 44.6 /
[VIOLET-CVPR2023] Stacked TF VS-TF BT 44.5 54.7
[ClipBERT-CVPR2021] Stacked TF CLIP-text BT 37.4 /
[VGT-ECCV2022] Dual TF Faster-RCNN BT 39.7 /
[CoVGT-TPAMI2023] Dual TF Faster-RCNN BT 40.0 /
[Video-ChatGPT-arXiv2023] LLM-Augmented ViT Vicuna 49.3 64.9
[LLaMA-Vid-arXiv2023] LLM-Augmented EVA-G Vicuna 58.9 70.0

Datasets

Dataset Links Video source Annotation Tasks #Videos/#Scenes
MSVD [Paper], [Dataset] YouTube videos Manual TVR, VC, VideoQA 1.9K
MSRVTT [Paper], [Dataset] Web videos Manual TVR, VC, VideoQA 7.2K
ActivityNet [Paper], [Dataset] YouTube videos Manual AL, TVR, VC, VMR 5.8K
FIBER [Paper], [Dataset] [VaTeX] Manual VC, VideoQA 28K
WildQA [Paper], [Dataset] YouTube videos Manual VideoQA 0.4K
NExT-QA [Paper], [Dataset] [VidOR] Manual VideoQA 5.4K
CausalVid-QA [Paper], [Dataset] [Kinetics-700] Manual VideoQA 26K
HowTo100M [Paper], [Dataset] YouTube videos Auto PT 1.2M
HD-VILA-100M [Paper], [Dataset] YouTube videos Auto PT 3.3M
YT-Temporal-180M [Paper], [Dataset] YouTube videos Auto PT 6M
TGIF-QA [Paper], [Dataset] Animated TGIFs Manual VideoQA 71K
TGIF-QA-R [Paper], [Dataset] [TGIF-QA] Manual, Auto VideoQA 71K
DiDeMo [Paper], [Dataset] [YFCC100M] Manual TVR 11K
YouCook2 [Paper], [Dataset] YouTube videos Manual TVR, VC 2K
HMDB-51 [Paper], [Dataset] Web videos Manual TVR, AR 6.8K
Kinetics-400 [Paper], [Dataset] YouTube videos Manual AR 306K
Kinetics-600 [Paper], [Dataset] [Kinetics-400] Manual AR, VG 480K
Kinetics-700 [Paper], [Dataset] [Kinetrics-600] Manual AR 650K
VaTeX [Paper], [Dataset] [Kinetrics-600] Manual TVR, VC 41K
TVR [Paper], [Dataset] [TVQA] Manual VMR 22K
How2R [Paper], [Dataset] [HowTo100M] Manual VMR 22K
How2QA [Paper], [Dataset] [HowTo100M] Manual VideoQA 22K
YouTube Highlights [Paper], [Dataset] YouTube videos Manual VMR 0.6K
TACoS [Paper], [Dataset] [MPII Composites] Manual VMR 0.1K
QVHighlights [Paper], [Dataset] YouTube vlogs Manual VMR 10K
TVSum [Paper], [Dataset] YouTube videos Manual VMR 50
ViTT [Paper], [Dataset] [YouTube-8M] Manual VMR 5.8K
VidChapters-7M [Paper], [Dataset] [YT-Temporal-180M] Auto VC, VMR 817K
VideoCC3M [Paper], [Dataset] Web videos Auto PT 6.3M
WebVid-10M [Paper], [Dataset] Web videos Auto PT 10.7M
COIN [Paper], [Dataset] YouTube videos Manual AS 12K
CrossTask [Paper], [Dataset] YouTube videos Manual AR 4.7K
Alivol-10M [Paper] E-commerce videos Auto PT 10M
LSMDC [Paper], [Dataset] British movies Manual TVR 72
EK-100 [Paper], [Dataset] Manual Manual AR, AL 7K
SSV1 [Paper], [Dataset] Manual Manual AR 108K
SSV2 [Paper], [Dataset] Manual Manual AR 221K
Moments in Time [Paper], [Dataset] Web videos Manual AR 1M
InternVid [Paper], [Dataset] YouTube videos Auto PT 7.1M
How2 [Paper], [Dataset] YouTube videos Auto VC 13.2K
WTS70M [Paper] YouTube videos Auto PT 70M
Charades [Paper], [Dataset] Manual Manual AR, VMR, VideoQA 10K

Paper list

Survey

  1. Survey: Transformer based video-language pre-training arXiv 2021 [Paper]
  2. Self-supervised learning for videos: A survey ACM Computing Survey 2022 [Paper] [Code]
  3. Video question answering: Datasets, algorithms and challenges EMNLP 2022 [Paper] [Code]
  4. Deep learning for video-text retrieval: a review IJMIR 2023 [Paper]
  5. A review of deep learning for video captioning arXiv 2023 [Paper]
  6. Video question answering: a survey of models and datasets Mobile Networks and Applications 2021 [Paper]

Model architecture perspective

Pre-transformer
  1. Video question answering via attribute-augmented attention network learning SIGIR 2017 [Paper] [Code]
  2. Convolutional Two-Stream Network Fusion for Video Action Recognition CVPR 2016 [Paper] [Code]
  3. Tensor-train recurrent neural networks for video classifcation arXiv 2017 [Paper] [Code]
  4. Two-stream rnn/cnn for action recognition in 3d videos IROS 2017 [Paper] [Code]
  5. Convnet architecture search for spatiotemporal feature learning arXiv 2017 [Paper] [Code]
  6. A joint sequence fusion model for video question answering and retrieval ECCV 2018 [Paper]
  7. Learning language-visual embedding for movie understanding with natural-language arXiv 2016 [Paper]
  8. Unifying visual-semantic embeddings with multimodal neural language models NeurIPS 2014 [Paper]
  9. Temporal tessellation for video annotation and summarization arXiv 2016 [Paper] [Code]
  10. End-to-end concept word detection for video captioning, retrieval, and question answering CVPR 2017 [Paper]
  11. Video captioning with multi-faceted attention arXiv 2016 [Paper]
  12. Describing videos by exploiting temporal structure ICCV 2015 [Paper] [Code]
  13. Video paragraph captioning using hierarchical recurrent neural networks CVPR 2016 [Paper]
  14. Localizing moments in video with natural language ICCV 2017 [Paper]
  15. Video question answering via attribute-augmented attention network learning SIGIR 2017 [Paper]
  16. Hierarchical boundary-aware neural encoder for video captioning CVPR 2017 [Paper]
  17. Tall: Temporal activity localization via language query ICCV 2017 [Paper] [Code]
  18. Leveraging video descriptions to learn video question answering AAAI 2017 [Paper]
Shared Transformer
  1. VATT: Transformers for multimodal selfsupervised learning from raw video, audio and text NeurIPS 2021 [Paper] [Code]
  2. Lavender: Unifying video-language understanding as masked language modeling arXiv 2022 [Paper] [Code]
  3. All in one: Exploring unifed video-language pretraining CVPR 2023 [Paper] [Code]
  4. An empirical study of end-to-end video-language transformers with masked visual modeling CVPR 2023 [Paper] [Code]
  5. VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling arXiv 2021 [Paper] [Code]
  6. Vindlu: A recipe for effective video-and-language pretraining CVPR 2023 [Paper] [Code]
  7. Less is more: Clipbert for video-and-language learning via sparse sampling CVPR 2021 [Paper] [Code]
Stack Transformer
  1. HERO: Hierarchical encoder for video+ language omni-representation pretraining EMNLP 2020 [Paper] [Code]
  2. End-to-end generative pretraining for multimodal video captioning arXiv 2022 [Paper]
  3. VLAB: Enhancing video language pre-training by feature adapting and blending arXiv 2023 [Paper]
  4. UniVL: A unifed video and language pre-training model for multimodal understanding and generation arXiv 2020 [Paper] [Code]
  5. CLIP meets video captioning: Concept-aware representation learning does matter PRCV 2022 [Paper] [Code]
  6. mPLUG-2: A modularized multimodal foundation model across text, image and video ICML 2023 [Paper] [Code]
Dual Transformer
  1. CLIP-ViP: Adapting pre-trained image-text model to videolanguage representation alignment ICLR 2023 [Paper] [Code]
  2. CLIP4Clip: An empirical study of clip for end to end video clip retrieval and captioning arXiv 2021 [Paper] [Code]
  3. Video graph transformer for video question answering ECCV 2022 [Paper] [Code]
  4. Contrastive video question answering via video graph transformer TPAMI 2023 [Paper] [Code]
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval ICCV 2021 [Paper] [Code]
  6. A CLIP-Hitchhiker’s guide to long video retrieval arXiv 2022 [Paper] [Code]
  7. ECLIPSE: Efficient long-range video retrieval using sight and sound ECCV 2022 [Paper] [Code]
  8. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding EMNLP 2021 [Paper] [Code]
LLM-augmented
  1. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding EMNLP 2023 [Paper] [Code]
  2. VideoChat: Chat-centric video understanding arXiv 2023 [Paper] [Code]
  3. VideoLLM: Modeling video sequence with large language models arXiv 2023 [Paper] [Code]
  4. LlaMA-VID: An image is worth 2 tokens in large language models arXiv 2023 [Paper] [Code]
  5. Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv 2023 [Paper]

Model training perspective

Pre-training
  1. CLIP2TV: Align, match and distill for video-text retrieval arXiv 2021 [Paper]
  2. Understanding chinese video and language via contrastive multimodal pre-training arXiv 2021 [Paper]
  3. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization NAACL 2021 [Paper] [Code]
  4. VideoBERT: A joint model for video and language representation learning ICCV 2019 [Paper] [Code]
  5. Learning video representations using contrastive bidirectional transformer arXiv 2019 [Paper]
  6. MERLOT: Multimodal neural script knowledge models NeurIPS 2021 [Paper] [Code]
  7. Revealing single frame bias for video-and-language learning arXiv 2022 [Paper] [Code]
  8. ActBERT: Learning Global-Local Video-Text Representations CVPR 2020 [Paper] [Code]
Fine-tuning
  1. Multilevel language and vision integration for text-to-clip retrieval AAAI 2019 [Paper] [Code]
  2. ST-Adapter: Parameter-efficient image-to-video transfer learning NeurIPS 2022 [Paper] [Code]
  3. Zero-shot video question answering via frozen bidirectional language models NeurIPS 2022 [Paper] [Code]
  4. Attentive Moment Retrieval in Videos SIGIR 2018 [Paper]
  5. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression arXiv 2018 [Paper]
  6. Cross-Modal Adapter for Text-Video Retrieval arXiv 2022 [Paper] [Code]
  7. AIM: Adapting Image Models for Efficient Video Action Recognition ICLR 2023 [Paper] [Code]
  8. Prompting Visual-Language Models for Efficient Video Understanding ECCV 2022 [Paper] [Code]
  9. Multi-modal Circulant Fusion for Video-to-Language and Backward IJCAI 2018 [Paper]
  10. Long-term temporal convolutions for action recognition arXiv 2016 [Paper] [Code]
  11. READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling AAAI 2024 [Paper] [Code]

Data perspective

Manual collection
  1. Advancing high-resolution video-language representation with large-scale video transcriptions CVPR 2022 [Paper] [Code]
  2. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips ICCV 2019 [Paper] [Code]
  3. FIBER: Fill-in-the-blanks as a challenging video understanding evaluation framework ACL 2022 [Paper] [Code]
  4. NExT-QA: Next phase of questionanswering to explaining temporal actions CVPR 2021 [Paper] [Code]
  5. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense arXiv 2017 [Paper] [Code]
  6. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100 IJCV 2020 [Paper] [Code]
  7. From representation to reasoning: Towards both evidence and commonsense reasoning for video question answering CVPR 2022 [Paper] [Code]
  8. Grounding Action Descriptions in Videos TACL 2013 [Paper]
  9. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments CVPR 2018 [Paper] [Code]
  10. Multimodal Pretraining for Dense Video Captioning AACL-IJCNLP 2020 [Paper] [Code]
  11. QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries NeurIPS 2021 [Paper] [Code]
  12. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research ICCV 2019 [Paper] [Code]
Data augmentation
  1. SVFormer: Semisupervised video transformer for action recognition CVPR 2023 [Paper] [Code]
  2. Semi-supervised video paragraph grounding with contrastive encoder CVPR 2022 [Paper]
  3. Learning temporal action proposals with fewer labels arXiv 2019 [Paper]
  4. Self-supervised learning for semi-supervised temporal action proposal CVPR 2021 [Paper] [Code]
  5. Semi-Supervised Action Recognition with Temporal Contrastive Learning CVPR 2021 [Paper] [Code]
  6. Learning Action Proposals With Fewer Labels arXiv 2019 [Paper]
Manual annotation
  1. Collecting Highly Parallel Data for Paraphrase Evaluation ACL 2011 [Paper] [Code]
  2. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language CVPR 2016 [Paper] [Code]
  3. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering CVPR 2017 [Paper] [Code]
  4. Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction arXiv 2018 [Paper] [Code]
  5. HMDB: A large video database for human motion recognition ICCV 2011 [Paper] [Code]
  6. The Kinetics Human Action Video Dataset arXiv 2017 [Paper] [Code]
  7. TVSum: Summarizing Web Videos Using Titles CVPR 2015 [Paper] [Code]
  8. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval ECCV 2020 [Paper] [Code]
  9. COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis CVPR 2019 [Paper] [Code]
  10. Cross-task weakly supervised learning from instructional videos CVPR 2019 [Paper] [Code]
  11. Moments in Time Dataset: one million videos for event understanding CVPR 2019 [Paper] [Code]
Automatic generation
  1. Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning ECCV 2024 [Paper] [Code]
  2. Progressive Graph Attention Network for Video Question Answering ACMMM 2021 [Paper] [Code]
  3. The StreetLearn Environment and Dataset arXiv 2019 [Paper] [Code]
  4. VidChapters-7M: Video Chapters at Scale NeurIPS 2023 [Paper] [Code]
  5. Learning Audio-Video Modalities from Image Captions arXiv 2022 [Paper]
  6. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation arXiv 2022 [Paper] [Code]
  7. How2: A Large-scale Dataset for Multimodal Language Understanding NeurIPS 2018 [Paper] [Code]
  8. Learning Video Representations from Textual Web Supervision arXiv 2020 [Paper]

About

[ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published