cs.CV(2026-04-10)
📊 共 12 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2)
支柱二:RL算法与架构 (RL & Architecture) (4)
支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)
支柱七:动作重定向 (Motion Retargeting) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Large-Scale Universal Defect Generation: Foundation Models and Datasets | 提出UniDG:一个大规模通用缺陷生成模型,解决缺陷生成数据匮乏问题。 | foundation model multimodal | ✅ | |
| 2 | Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection | 提出ImageProtector,通过视觉提示注入防御多模态大语言模型分析图像 | large language model | ||
| 3 | Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization | Mosaic:多视角集成优化,提升针对闭源VLM的多模态越狱攻击 | multimodal | ||
| 4 | PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos | PinpointQA:室内视频中小物体空间理解数据集与基准 | large language model multimodal | ✅ | |
| 5 | Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts | 视觉语言模型并非感知盲区,而是仲裁失败:探究视觉-语言冲突的解决机制 | multimodal visual grounding | ||
| 6 | SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos | SiMing-Bench:评估临床技能视频中持续交互的过程正确性 | large language model multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Learning Vision-Language-Action World Models for Autonomous Driving | 提出VLA-World模型,融合预测想象与反思推理,提升自动驾驶的预见性和安全性。 | reinforcement learning world model world models | ||
| 8 | Visually-Guided Policy Optimization for Multimodal Reasoning | 提出VGPO,增强视觉引导的多模态推理能力,解决视觉信息利用不足问题 | reinforcement learning multimodal | ||
| 9 | PhysInOne: Visual Physics Learning and Reasoning in One Suite | PhysInOne:构建大规模物理场景数据集,促进AI系统物理推理能力 | world model world models embodied AI | ||
| 10 | VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning | VL-Calibration:解耦视觉-语言大模型推理中的置信度校准 | reinforcement learning multimodal visual grounding |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing | 提出HM-Bench,用于评估多模态大语言模型在高光谱遥感图像理解中的能力。 | HSI large language model multimodal | ✅ |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | Envisioning the Future, One Step at a Time | 提出基于稀疏轨迹扩散模型的开放场景未来预测方法,实现高效且逼真的长时序模拟。 | motion prediction | ✅ |