A comprehensive surevy on Multimodal Models in 3D
- Classification
- Detection
- Segmentation
- Tracking
- Localization
- Retrival
- Scene Understanding
- Editing and Manupulation
- Generation
- Grounding
- Captioning
- Pose Estimation
- Question Answering
- Pretraining
- Matching
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
ClipFace: Text-guided Editing of Textured 3D Morphable Models | nan | nan | 2023 | |
CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout | nan | nan | 2023 | |
Volumetric Disentanglement for 3D Scene Manipulation | nan | nan | 2022 | |
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions | nan | nan | 2023 | |
LADIS: Language Disentanglement for 3D Shape Editing | nan | nan | 2022 | |
Local 3D Editing via 3D Distillation of CLIP Knowledge | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
3D Multi-Object Tracking Using Graph Neural Networks with Cross-Edge Modality Attention | nan | nan | 2022 | |
LATTE: LAnguage Trajectory TransformEr | nan | nan | 2022 | |
3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking | nan | nan | 2023 | |
EagerMOT: 3D Multi-Object Tracking via Sensor Fusion | nan | nan | 2021 | |
MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Complementary Pseudo Multimodal Feature for Point Cloud Anomaly Detection | nan | nan | 2023 | |
EasyNet: An Easy Network for 3D Industrial Anomaly Detection | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding | nan | nan | 2022 | |
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding | nan | nan | 2022 | |
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | nan | nan | 2023 | |
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | nan | nan | 2023 | |
Multi-View Transformer for 3D Visual Grounding | nan | nan | 2022 | |
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding | nan | nan | 2022 | |
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection | nan | nan | 2022 | |
3D VR Sketch Guided 3D Shape Prototyping and Exploration | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
TeSTNeRF: Text-Driven 3D Style Transfer via Cross-Modal Learning | nan | nan | 2023 | |
TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition | nan | nan | 2022 | |
HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks | nan | nan | 2023 | |
CLIP3Dstyler: Language Guided 3D Arbitrary Neural Style Transfer | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Towards Label-free Scene Understanding by Vision Foundation Models | nan | nan | 2023 | |
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP | nan | nan | 2023 | |
Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction | nan | nan | 2023 | |
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding | nan | nan | 2023 | |
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding | nan | nan | 2023 | |
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models | nan | nan | 2022 | |
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation | nan | nan | 2023 | |
TextDeformer: Geometry Manipulation using Text Guidance | nan | nan | 2023 | |
Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting | nan | nan | 2023 | |
RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval | nan | nan | 2023 | |
TextANIMAR: Text-based 3D Animal Fine-Grained Retrieval | nan | nan | 2023 | |
SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval | nan | nan | 2023 | |
OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data | nan | nan | 2023 | |
Towards 3D VR-Sketch to 3D Shape Retrieval | nan | nan | 2022 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Multimodal Brain Disease Classification with Functional Interaction Learning from Single fMRI Volume | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions | nan | nan | 2023 | |
UnLoc: A Universal Localization Method for Autonomous Vehicles using LiDAR, Radar and/or Camera Input | nan | nan | 2023 | |
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Towards Zero-Shot Scale-Aware Monocular Depth Estimation | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
ImageBind-LLM: Multi-modality Instruction Tuning | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
LiCamGait: Gait Recognition in the Wild by Using LiDAR and Camera Multi-modal Visual Sensors | nan | nan | 2022 | |
LATFormer: Locality-Aware Point-View Fusion Transformer for 3D Shape Recognition | nan | nan | 2023 | |
Cross-Modal Learning with 3D Deformable Attention for Action Recognition | nan | nan | 2023 | |
FER-former: Multi-modal Transformer for Facial Expression Recognition | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation | nan | nan | 2023 | |
Zero-1-to-3: Zero-shot One Image to 3D Object | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Style-aware Augmented Virtuality Embeddings (SAVE) | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding | nan | nan | 2023 |
Title | arXiv | Github | WebSite | Pub. & Date |
---|---|---|---|---|
Scalable 3D Captioning with Pretrained Models | nan | nan | 2023 |