Awesome-Multimodal-Sequence-Learning

Reading list for multimodal sequence learning

Survey Papers

Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2019

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, arXiv 2019

Deep Multimodal Representation Learning: A Survey, arXiv 2019

Representation Learning: A Review and New Perspectives, TPAMI 2013

Research Areas

Blindfold Baselines for Embodied QA, NIPS 2018 Visually-Grounded Interaction and Language Workshop

Analyzing the Behavior of Visual Question Answering Models, EMNLP 2016

Multimodal Pretraining

PaLI: A Jointly-Scaled Multilingual Language-Image Model, ICLR 2023

HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention, ICLR 2023 [code]

Composing Ensembles of Pre-trained Models via Iterative Consensus

Multi-stage Pre-training over Simplified Multimodal Pre-training Models, ACL 2021

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021 [code]

Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 [code]

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, EMNLP 2020 [code]

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]

Self-supervised Learning

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, arXiv 2021, [code]

Self-supervised Representation Learning with Relative Predictive Coding, ICLR 2021

Exploring Balanced Feature Spaces for Representation Learning, ICLR 2021

There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge, CVPR 2021, [code], [homepage]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020 [code]

Self-Supervised MultiModal Versatile Networks, NeurIPS 2020 [code]

Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision, NeurIPS 2020 [code]

Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017

Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NeurIPS 2020, [code]

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, PMLR 2020, [code]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020, [code]

Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021, [code]

Generative Multimodal Models

Grounding Language Models to Images for Multimodal Inputs and Outputs, ICML 2023 [code]

Retrieval-Augmented Multimodal Language Modeling, ICML 2023 [webpage]

Make-A-Video: Text-to-Video Generation without Text-Video Data, ICLR 2023 [website]

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation, ICLR 2023 [code]

Unified Discrete Diffusion for Simultaneous Vision-Language Generation, ICLR 2023 [code]

MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises, ICLR 2023

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation, CVPR 2023 [code]

Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models, ICLR 2021

Generalized Multimodal ELBO, ICLR 2021

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021

Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]

Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018

The Multi-Entity Variational Autoencoder, NeurIPS 2017

Multimodal Adversarial Attacks

Data Poisoning Attacks Against Multimodal Encoders, ICML 2023 [code]

Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]

Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018

Multimodal Reasoning

Multimodal Analogical Reasoning over Knowledge Graphs, ICLR 2023 [code]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, ICLR 2023 [code]

Research Tasks

Sentiment and Emotion Analysis

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation, ACL 2021

Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks, ACL 2021

Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis, ACL 2021

Multi-Label Few-Shot Learning for Aspect Category Detection, ACL 2021

Directed Acyclic Graph Network for Conversational Emotion Recognition, ACL 2021

CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network, ACL 2021

Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data, ACL 2021

A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis, ACL Findings 2021

Trajectory and Motion Forecasting

HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents, ICLR 2021

Multimodal Motion Prediction with Stacked Transformers, CVPR 2021 [code]

Social NCE: Contrastive Learning of Socially-aware Motion Representations, ICCV 2021, [code]

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction, ECCV 2020, [code]

Datasets

An Extensible Multi-modal Multi-task Object Dataset with Materials, ICLR 2023 [download]

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues, ACL 2021

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks, ACL 2020

CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality, ACl 2020, [code]

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French, EMNLP 2020 [download]

YouTube-8: Predicting Emotions in User-Generated Videos, [download], [webpage]

LAION: LAION-400M

Tutorials and blogs

Deep learning 2021 - NYU

Blog-lilianweng

SSL-paper list

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Sequence-Learning

Table of Contents

Survey Papers

Research Areas

Representation Learning

Multimodal Fusion

Analysis of Multimodal Models

Multimodal Pretraining

Self-supervised Learning

Generative Multimodal Models

Multimodal Adversarial Attacks

Multimodal Reasoning

Research Tasks

Sentiment and Emotion Analysis

Trajectory and Motion Forecasting

Datasets

Tutorials and blogs

About

Releases

Packages

License

Redaimao/awesome-multimodal-sequence-learning

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Sequence-Learning

Table of Contents

Survey Papers

Research Areas

Representation Learning

Multimodal Fusion

Analysis of Multimodal Models

Multimodal Pretraining

Self-supervised Learning

Generative Multimodal Models

Multimodal Adversarial Attacks

Multimodal Reasoning

Research Tasks

Sentiment and Emotion Analysis

Trajectory and Motion Forecasting

Datasets

Tutorials and blogs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages