Reading list for multimodal sequence learning
Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2019
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, arXiv 2019
Deep Multimodal Representation Learning: A Survey, arXiv 2019
Representation Learning: A Review and New Perspectives, TPAMI 2013
Robustness in Multimodal Learning under Train-Test Modality Mismatch
Calibrating Multimodal Learning, ICML 2023
Learning Multimodal Data Augmentation in Feature Space, ICLR 2023
Multimodal Federated Learning via Contrastive Representation Ensemble, ICLR 2023
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning, NeurlPS 2021, [code]
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021
Multimodal Contrastive Training for Visual Representation Learning, CVPR 2021
Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021
Viewmaker Networks: Learning Views for Unsupervised Representation Learning, ICLR 2021, [code]
Representation Learning for Sequence Data with Deep Autoencoding Predictive Components, ICLR 2021
Improving Transformation Invariance in Contrastive Representation Learning, ICLR 2021
Active Contrastive Learning of Audio-Visual Video Representations, ICLR 2021
Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021
i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning, ICLR 2021
Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections, ICLR 2021
Adaptive Transformers for Learning Multimodal Representations, ACL 2020
Learning Transferable Visual Models From Natural Language Supervision, arXiv 2020 [blog] [code]
12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020 [code]
Watching the World Go By: Representation Learning from Unlabeled Videos, arXiv 2020
Contrastive Multiview Coding, ECCV 2020 [code]
Representation Learning with Contrastive Predictive Coding, arXiv 2019 [code]
Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations, EMNLP 2019
Visual Concept-Metaconcept Learning, NeurIPS 2019 [code]
ViCo: Word Embeddings from Visual Co-occurrences, ICCV 2019 [code]
Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019
Learning Factorized Multimodal Representations, ICLR 2019 [code]
Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning, arXiv 2019 [code]
Learning Representations by Maximizing Mutual Information Across Views, arXiv 2019 [code]
Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018
Learning Robust Visual-Semantic Embeddings, ICCV 2017
Deep Multimodal Representation Learning from Temporal Data, CVPR 2017
Combining Language and Vision with a Multimodal Skip-gram Model, NAACL 2015
Learning Grounded Meaning Representations with Autoencoders, ACL 2014
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS 2014
Multimodal Learning with Deep Boltzmann Machines, JMLR 2014
DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013
Multimodal Deep Learning, ICML 2011
Provable Dynamic Fusion for Low-Quality Multimodal Data, ICML 2023, [code]
Deep multimodal sequence fusion by regularized expressive representation distillation, TMM 2022, [code]
Pace-adaptive and Noise-resistant Contrastive Learning for Multimodal Feature Fusion, TMM 2023
Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion, EMNLP 2021, [code]
Attention Bottlenecks for Multimodal Fusion, ArXiv 2021
Contrastive Multimodal Fusion with TupleInfoNCE, ArXiv 2021
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning, ICLR 2021, [e]
Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization, ICLR 2021
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering, EMNLP 2020
VolTAGE: Volatility Forecasting via Text Audio Fusion with Graph Convolution Networks for Earnings Calls, EMNLP 2020
Dual Low-Rank Multimodal Fusion, EMNLP Findings 2020
Trusted Multi-View Classification, ICLR 2021 [code]
Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis, ICDM 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies, NeurIPS 2020 [code]
Deep Multimodal Fusion by Channel Exchanging, NeurIPS 2020 [code]
What Makes Training Multi-Modal Classification Networks Hard?, CVPR 2020
DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis, IJCAI 2019 [code]
Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, NeurIPS 2019
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification, IEEE TNNLS 2019 [code]
MFAS: Multimodal Fusion Architecture Search, CVPR 2019
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]
Dynamic Fusion for Multimodal Data, arXiv 2019
Unifying and merging well-trained deep neural networks for inference stage, IJCAI 2018 [code]
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]
Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018 [code]
Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017 [code]
The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation, ICLR 2023 [code]
Post-hoc Concept Bottleneck Models, ICLR 2023, [code]
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks, ICLR 2023, [code]
Identifiability Results for Multimodal Contrastive Learning, ICLR 2023 [code]
MultiViz: Towards Visualizing and Understanding Multimodal Models, ICLR 2023 [code]
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, ICCV 2021, [code]
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, TACL 2021
Blindfold Baselines for Embodied QA, NIPS 2018 Visually-Grounded Interaction and Language Workshop
Analyzing the Behavior of Visual Question Answering Models, EMNLP 2016
PaLI: A Jointly-Scaled Multilingual Language-Image Model, ICLR 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention, ICLR 2023 [code]
Composing Ensembles of Pre-trained Models via Iterative Consensus
Multi-stage Pre-training over Simplified Multimodal Pre-training Models, ACL 2021
Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021 [code]
Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 [code]
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, EMNLP 2020 [code]
Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]
VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019
M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019 [code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, arXiv 2021, [code]
Self-supervised Representation Learning with Relative Predictive Coding, ICLR 2021
Exploring Balanced Feature Spaces for Representation Learning, ICLR 2021
There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge, CVPR 2021, [code], [homepage]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020 [code]
Self-Supervised MultiModal Versatile Networks, NeurIPS 2020 [code]
Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision, NeurIPS 2020 [code]
Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019
Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017
Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NeurIPS 2020, [code]
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, PMLR 2020, [code]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020, [code]
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021, [code]
Grounding Language Models to Images for Multimodal Inputs and Outputs, ICML 2023 [code]
Retrieval-Augmented Multimodal Language Modeling, ICML 2023 [webpage]
Make-A-Video: Text-to-Video Generation without Text-Video Data, ICLR 2023 [website]
Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation, ICLR 2023 [code]
Unified Discrete Diffusion for Simultaneous Vision-Language Generation, ICLR 2023 [code]
MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises, ICLR 2023
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation, CVPR 2023 [code]
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models, ICLR 2021
Generalized Multimodal ELBO, ICLR 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021
Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]
Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018
The Multi-Entity Variational Autoencoder, NeurIPS 2017
Data Poisoning Attacks Against Multimodal Encoders, ICML 2023 [code]
Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]
Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018
Multimodal Analogical Reasoning over Knowledge Graphs, ICLR 2023 [code]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, ICLR 2023 [code]
MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation, ACL 2021
Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks, ACL 2021
Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis, ACL 2021
Multi-Label Few-Shot Learning for Aspect Category Detection, ACL 2021
Directed Acyclic Graph Network for Conversational Emotion Recognition, ACL 2021
Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data, ACL 2021
A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis, ACL Findings 2021
HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents, ICLR 2021
Multimodal Motion Prediction with Stacked Transformers, CVPR 2021 [code]
Social NCE: Contrastive Learning of Socially-aware Motion Representations, ICCV 2021, [code]
The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction, ECCV 2020, [code]
An Extensible Multi-modal Multi-task Object Dataset with Materials, ICLR 2023 [download]
A Large-Scale Chinese Multimodal NER Dataset with Speech Clues, ACL 2021
A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks, ACL 2020
CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality, ACl 2020, [code]
CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French, EMNLP 2020 [download]
YouTube-8: Predicting Emotions in User-Generated Videos, [download], [webpage]
LAION: LAION-400M