Paper List for Robotics & Embodied AI - Tianxing Chen
-
Manipulation
- Imitation Learning
- Diffusion Policy & Diffuser
- Humanoid
- Dexterous Manipulation
- Imitation Learning
-
LLM for Embodied AI
- LLM Agent
-
Foundation Models for Embodied AI
- Affordance
- Correspondence
- Tracking & Estimation
- Generative Models
-
Reinforcement Learning
-
Motion Generation
-
Robot Hardware
-
Dataset & Benchmark
-
Diffusion Model for Planning, Policy, and RL
-
3D-based Manipulation
-
2D-based Manipulation
-
LLM for robotics
-
LLM Agent (Planning)
-
Generative Model for Embodied
-
Visual Feature: Correspondence, Affordance
-
Detection & Segmentation
-
Pose Estimation and Tracking
-
Humanoid
-
Dataset & Benchmark
-
Hardware
-
2D to 3D Generation
-
Gaussion Splatting
-
Robotics for Medical
-
Companies
-
[arXiv] Diffusion Models for Reinforcement Learning: A Survey, arXiv
-
[ICLR 23 (Top 5% Notable)] Is Conditional Generative Modeling all you need for Decision-Making?, website
-
[RSS 23] Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, website
-
[ICML 22 (Long Talk)] Planning with Diffusion for Flexible Behavior Synthesis, website
-
[ICML 23 Oral] Adaptdiffuser: Diffusion models as adaptive self-evolving planners, website
-
[CVPR 24] SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution, website
-
[arXiv] Learning a Diffusion Model Policy From Reward via Q-Score Matching, arXiv
-
[CoRL 23] ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation, website
-
[CVPR 23] Affordance Diffusion: Synthesizing Hand-Object Interactions, website
-
[arXiv] DiffuserLite: Towards Real-time Diffusion Planning, arXiv
-
[arXiv] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, website
-
[arXiv] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations, website
-
[arXiv] SafeDiffuser: Safe Planning with Diffusion Probabilistic Models, arXiv
-
[CVPR 24] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation, arXiv
-
[arXiv 24] Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning, arXiv
-
[arXiv 24] Surgical Robot Transformer: Imitation Learning for Surgical Tasks, website
-
[CoRL 24] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy, website
-
[RSS 24] RVT-2: Learning Precise Manipulation from Few Examples website
-
[arXiv 23] D3 Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation, website
-
[arXiv 24] UniDoorManip: Learning Universal Door Manipulation Policy Over Large-scale and Diverse Door Manipulation Environments, website
-
[CoRL 23 (Oral)] GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields, website
-
[ECCV 24] ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation, website
-
[IROS 24] RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective, website
- GraspNet website:
- [TRO 23] AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains, arXiv
- [arXiv 24] ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter, website
- [arXiv 24] GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping, website
- [CVPR 22 Oral] Ditto: Building Digital Twins of Articulated Objects from Interaction, website
- [ICRA 24] RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation, website
- [NIPS 23] MoVie: Visual Model-Based Policy Adaptation for View Generalization, website
-
[arXiv 24] OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics, website
-
[CoRL 23] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, website
-
[arXiv 23] ChatGPT for Robotics: Design Principles and Model Abilities, arXiv
-
[arXiv 24] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation, arXiv
-
[PMLR 23] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, website
- [NIPS 23] Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning, website
-
[arXiv 24] Generative Image as Action Models, website
-
[arXiv 24] Genie: Generative Interactive Environments, website
-
[arXiv 23] D3 Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation, website
-
[CoRL 20] Transporter Networks: Rearranging the Visual World for Robotic Manipulation, website
-
[ICLR 24] SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation, website
-
[ICRA 24] UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence, website
-
[CoRL 2018] Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, PDF
-
[arXiv 24] Theia: Distilling Diverse Vision Foundation Models for Robot Learning, website, Github repo
-
[CoRL 22] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation, website
-
[arXiv 24] Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation, arXiv
-
[arXiv 24] PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments, arXiv
-
[ICLR 22] VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects, website
-
[ICLR 23] DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Object Manipulation, arXiv
-
[CVPR 22] Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, website
-
[ICCV 23] AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose, website
-
[ECCV 24] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Github repo
-
[arXiv 24] Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect, Segment and Generate Anything, Github repo
-
[ICCV 23] DEVA: Tracking Anything with Decoupled Video Segmentation, website
-
[ECCV 22] Mem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, website
-
[ICCV 23] VLPart: Going Denser with Open-Vocabulary Part Segmentation, website
-
LangSAM Github repo, combining Grounding DINO and SAM
-
[CVPR 24 (Highlight)] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects, website
-
[CVPR 23 (Highlight)] GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts, website
-
[arXiv 23] GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects, website
-
[arXiv 24] ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics, website
-
[ICCV 23] AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose, website
-
[CVPR 23] BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects, website
-
[arXiv 24] WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild, website
- [arXiv 24] HumanPlus: Humanoid Shadowing and Imitation from Humans, website
- [arXiv 24] Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks, website, zhihu
- [arXiv 24] GRUtopia: Dream General Robots in a City at Scale, Github Repo
- [ICLR 24] AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents, website
- [arXiv 24] RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios, Github repo
- [arXiv 24] BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark, website
- [arXiv 24] Evaluating Real-World Robot Manipulation Policies in Simulation, website
- [arXiv 23] Objaverse-XL: A Universe of 10M+ 3D Objects, website
- [arXiv 24] DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation, website
- [arXiv 24] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image, website
- [SIGGRAPH 24] 2DGS: 2D Gaussian Splatting for Geometrically Accurate Radiance Fields, website
- [arXiv 24] Surgical Robot Transformer: Imitation Learning for Surgical Tasks, website
-
Where2Act: From Pixels to Actions for Articulated 3D Objects
-
PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
-
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
-
AO-Grasp: Articulated Object Grasp Generation
-
Human-to-Robot Imitation in the Wild
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
-
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation https://sam-embodied.github.io/, ICML24
-
PerAct, Act3D
-
Probing the 3D Awareness of Visual Foundation Model: https://arxiv.org/pdf/2404.08636
-
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
-
CLIP: Zero-shot Jack of All Trades, website, CLIP GradCAM CLIP_GradCAM_Visualization
-
Articulated Object Manipulation with Coarse-to-fine Affordance for Mitigating the Effect of Point Cloud Noise: https://arxiv.org/pdf/2402.18699
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
-
PDDLGym: Gym Environments from PDDL Problems: https://arxiv.org/abs/2002.06432
-
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
-
VisionLLM: https://arxiv.org/abs/2305.11175
-
Ferret: Refer and Ground Anything Anywhere at Any Granularity: https://github.com/apple/ml-ferret
-
LangSplat
-
Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity
-
SparseDFF
-
ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics
-
Stabilizing Transformers for Reinforcement Learning
- Summary: 本文提出了Gated Transformer-XL (GTrXL),一种改进的Transformer架构,用于解决标准Transformer在强化学习中的优化难题。通过引入层归一化和门控机制,GTrXL在部分可观察性环境中取得了优于LSTM的性能。
- 链接
-
CoBERL: Contrastive BERT for Reinforcement Learning
- Summary: 文章介绍了CoBERL,它结合了对比损失和Transformer架构,通过双向掩码预测和对比学习方法提高强化学习中的数据效率和性能。
- 链接
-
Adaptive Transformers in RL
- Summary: 该研究探索了在强化学习中使用具有自适应注意力跨度的Transformer模型,发现这种方法能够提高模型在需要长期依赖的环境中的性能。
- 链接
-
Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
- Summary: 本文提出了Actor-Learner Distillation (ALD)方法,通过从大型学习者模型向小型执行者模型进行知识蒸馏,以提高Transformer在强化学习中的样本效率。
- 链接
-
Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
- Summary: 介绍了Deep Transformer Q-Networks (DTQN),这是一种新型的强化学习架构,使用Transformer的自注意力机制来处理部分可观察性任务,并在多个挑战性环境中展示了有效性。
- 链接
-
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer
- Summary: CtrlFormer是一种新型的Transformer架构,专注于通过学习可迁移的状态表示来提高视觉控制任务的样本效率,特别强调了在跨任务迁移学习方面的优势。
- 链接
Sapiens: Foundation for Human Vision Models: https://about.meta.com/realitylabs/codecavatars/sapiens General Flow as Foundation Affordance for Scalable Robot Learning https://general-flow.github.io/