cs.CV(2025-12-23)
📊 共 23 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (8)
支柱二:RL算法与架构 (RL & Architecture) (5 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1)
支柱一:机器人控制 (Robot Control) (3 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (1)
支柱八:物理动画 (Physics-based Animation) (1)
支柱六:视频提取与匹配 (Video Extraction) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition | 提出多模态对齐、翻译、融合与迁移方法,提升复杂输入理解与识别能力 | distillation egocentric multimodal | ||
| 10 | AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model | 提出AMoE,一种高效的Agglomerative Mixture-of-Experts视觉基础模型,通过多教师蒸馏实现。 | representation learning distillation foundation model | ||
| 11 | Active Intelligence in Video Avatars via Closed-loop World Modeling | 提出ORCA框架,通过闭环世界建模实现视频化身的主动智能 | world model | ||
| 12 | milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion | milliMamba:基于双毫米波雷达和多帧Mamba融合的抗镜面反射人体姿态估计 | Mamba | ✅ | |
| 13 | DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation | DDAVS:解耦音频语义与延迟双向对齐,用于音视频分割 | contrastive learning multimodal | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS) | 利用3D高斯溅射增强5D苹果姿态估计的标注效率 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 15 | SirenPose: Dynamic Scene Reconstruction via Geometric Supervision | SirenPose:通过几何监督实现动态场景的精确重建与时序一致性 | scene reconstruction physically plausible spatiotemporal | ||
| 16 | AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment | AlignPose:通过多视角特征度量对齐实现通用6D位姿估计 | 6D pose estimation | ||
| 17 | SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images | SmartSplat:提出特征感知的GS图像压缩框架,实现超高分辨率图像的高效压缩与高质量重建。 | 3D gaussian splatting gaussian splatting splatting | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 18 | LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation | 提出LADLE-MM,一种基于有限标注和集成学习的多模态信息检测器,适用于资源受限场景。 | manipulation multimodal | ||
| 19 | Dreamcrafter: Immersive Editing of 3D Radiance Fields Through Flexible, Generative Inputs and Outputs | Dreamcrafter:通过灵活的生成式输入输出实现沉浸式3D辐射场编辑 | manipulation 3D gaussian splatting gaussian splatting | ||
| 20 | LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving | LEAD:最小化端到端驾驶中学习者-专家不对称性,提升CARLA模拟器驾驶性能 | sim-to-real imitation learning | ✅ |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation | 提出TAVID,通过跨模态映射实现文本驱动的交互式音视频对话生成。 | dyadic interaction multimodal |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments | 研究双视角视频输入对自然驾驶环境下分心检测的影响,强调融合设计的重要性。 | spatiotemporal multimodal |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning | 提出DETACH框架,通过解耦时空对齐解决外中心视频与环境传感器融合问题 | egocentric |