Skip to content

Redaimao/awesome-multimodal-sequence-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Sequence-Learning

Reading list for multimodal sequence learning

Table of Contents

Survey Papers

Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2019

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, arXiv 2019

Deep Multimodal Representation Learning: A Survey, arXiv 2019

Representation Learning: A Review and New Perspectives, TPAMI 2013

Research Areas

Representation Learning

Robustness in Multimodal Learning under Train-Test Modality Mismatch

Calibrating Multimodal Learning, ICML 2023

Learning Multimodal Data Augmentation in Feature Space, ICLR 2023

Multimodal Federated Learning via Contrastive Representation Ensemble, ICLR 2023

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning, NeurlPS 2021, [code]

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021

Multimodal Contrastive Training for Visual Representation Learning, CVPR 2021

Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021

Viewmaker Networks: Learning Views for Unsupervised Representation Learning, ICLR 2021, [code]

Representation Learning for Sequence Data with Deep Autoencoding Predictive Components, ICLR 2021

Improving Transformation Invariance in Contrastive Representation Learning, ICLR 2021

Active Contrastive Learning of Audio-Visual Video Representations, ICLR 2021

Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021

i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning, ICLR 2021

Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections, ICLR 2021

Adaptive Transformers for Learning Multimodal Representations, ACL 2020

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2020 [blog] [code]

12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020 [code]

Watching the World Go By: Representation Learning from Unlabeled Videos, arXiv 2020

Contrastive Multiview Coding, ECCV 2020 [code]

Representation Learning with Contrastive Predictive Coding, arXiv 2019 [code]

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations, EMNLP 2019

Visual Concept-Metaconcept Learning, NeurIPS 2019 [code]

ViCo: Word Embeddings from Visual Co-occurrences, ICCV 2019 [code]

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations, CVPR 2019

Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019

Learning Factorized Multimodal Representations, ICLR 2019 [code]

Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019

OmniNet: A Unified Architecture for Multi-modal Multi-task Learning, arXiv 2019 [code]

Learning Representations by Maximizing Mutual Information Across Views, arXiv 2019 [code]

A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks, ICML 2018

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018

Learning Robust Visual-Semantic Embeddings, ICCV 2017

Deep Multimodal Representation Learning from Temporal Data, CVPR 2017

Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations, COLING 2016

Combining Language and Vision with a Multimodal Skip-gram Model, NAACL 2015

Learning Grounded Meaning Representations with Autoencoders, ACL 2014

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS 2014

Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013

Multimodal Deep Learning, ICML 2011

Multimodal Fusion

Provable Dynamic Fusion for Low-Quality Multimodal Data, ICML 2023, [code]

Deep multimodal sequence fusion by regularized expressive representation distillation, TMM 2022, [code]

Pace-adaptive and Noise-resistant Contrastive Learning for Multimodal Feature Fusion, TMM 2023

Unimodal and Crossmodal Refinement Network for Multimodal Sequence Fusion, EMNLP 2021, [code]

Attention Bottlenecks for Multimodal Fusion, ArXiv 2021

Contrastive Multimodal Fusion with TupleInfoNCE, ArXiv 2021

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning, ICLR 2021, [e]

Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization, ICLR 2021

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering, EMNLP 2020

VolTAGE: Volatility Forecasting via Text Audio Fusion with Graph Convolution Networks for Earnings Calls, EMNLP 2020

Dual Low-Rank Multimodal Fusion, EMNLP Findings 2020

Trusted Multi-View Classification, ICLR 2021 [code]

Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis, ICDM 2020

Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies, NeurIPS 2020 [code]

Deep Multimodal Fusion by Channel Exchanging, NeurIPS 2020 [code]

What Makes Training Multi-Modal Classification Networks Hard?, CVPR 2020

DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis, IJCAI 2019 [code]

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, NeurIPS 2019

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification, IEEE TNNLS 2019 [code]

MFAS: Multimodal Fusion Architecture Search, CVPR 2019

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]

Dynamic Fusion for Multimodal Data, arXiv 2019

Unifying and merging well-trained deep neural networks for inference stage, IJCAI 2018 [code]

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]

Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018 [code]

Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017 [code]

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework, AAAI 2015

Analysis of Multimodal Models

The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation, ICLR 2023 [code]

Post-hoc Concept Bottleneck Models, ICLR 2023, [code]

CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks, ICLR 2023, [code]

Identifiability Results for Multimodal Contrastive Learning, ICLR 2023 [code]

MultiViz: Towards Visualizing and Understanding Multimodal Models, ICLR 2023 [code]

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, ICCV 2021, [code]

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, TACL 2021

Blindfold Baselines for Embodied QA, NIPS 2018 Visually-Grounded Interaction and Language Workshop

Analyzing the Behavior of Visual Question Answering Models, EMNLP 2016

Multimodal Pretraining

PaLI: A Jointly-Scaled Multilingual Language-Image Model, ICLR 2023

HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention, ICLR 2023 [code]

Composing Ensembles of Pre-trained Models via Iterative Consensus

Multi-stage Pre-training over Simplified Multimodal Pre-training Models, ACL 2021

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021 [code]

Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 [code]

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, EMNLP 2020 [code]

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]

Self-supervised Learning

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, arXiv 2021, [code]

Self-supervised Representation Learning with Relative Predictive Coding, ICLR 2021

Exploring Balanced Feature Spaces for Representation Learning, ICLR 2021

There Is More Than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking With Sound by Distilling Multimodal Knowledge, CVPR 2021, [code], [homepage]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020 [code]

Self-Supervised MultiModal Versatile Networks, NeurIPS 2020 [code]

Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision, NeurIPS 2020 [code]

Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017

Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NeurIPS 2020, [code]

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, PMLR 2020, [code]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020, [code]

Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021, [code]

Generative Multimodal Models

Grounding Language Models to Images for Multimodal Inputs and Outputs, ICML 2023 [code]

Retrieval-Augmented Multimodal Language Modeling, ICML 2023 [webpage]

Make-A-Video: Text-to-Video Generation without Text-Video Data, ICLR 2023 [website]

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation, ICLR 2023 [code]

Unified Discrete Diffusion for Simultaneous Vision-Language Generation, ICLR 2023 [code]

MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises, ICLR 2023

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation, CVPR 2023 [code]

Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models, ICLR 2021

Generalized Multimodal ELBO, ICLR 2021

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021

Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]

Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018

The Multi-Entity Variational Autoencoder, NeurIPS 2017

Multimodal Adversarial Attacks

Data Poisoning Attacks Against Multimodal Encoders, ICML 2023 [code]

Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]

Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018

Multimodal Reasoning

Multimodal Analogical Reasoning over Knowledge Graphs, ICLR 2023 [code]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, ICLR 2023 [code]

Research Tasks

Sentiment and Emotion Analysis

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation, ACL 2021

Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks, ACL 2021

Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis, ACL 2021

Multi-Label Few-Shot Learning for Aspect Category Detection, ACL 2021

Directed Acyclic Graph Network for Conversational Emotion Recognition, ACL 2021

CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network, ACL 2021

Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data, ACL 2021

A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis, ACL Findings 2021

Trajectory and Motion Forecasting

HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents, ICLR 2021

Multimodal Motion Prediction with Stacked Transformers, CVPR 2021 [code]

Social NCE: Contrastive Learning of Socially-aware Motion Representations, ICCV 2021, [code]

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction, ECCV 2020, [code]

Datasets

An Extensible Multi-modal Multi-task Object Dataset with Materials, ICLR 2023 [download]

A Large-Scale Chinese Multimodal NER Dataset with Speech Clues, ACL 2021

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks, ACL 2020

CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality, ACl 2020, [code]

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French, EMNLP 2020 [download]

YouTube-8: Predicting Emotions in User-Generated Videos, [download], [webpage]

LAION: LAION-400M

Tutorials and blogs

Deep learning 2021 - NYU

Blog-lilianweng

SSL-paper list

About

Reading list for multimodal sequence learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published