cs.CV(2025-12-26)
📊 共 11 篇论文 | 🔗 1 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱八:物理动画 (Physics-based Animation) (2)
支柱三:空间感知与语义 (Perception & Semantics) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception | iSHIFT:轻量级自适应感知慢-快GUI代理,提升交互效率与精度 | large language model multimodal visual grounding | ||
| 2 | See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning | 提出双向感知塑形方法以提升多模态推理能力 | multimodal | ||
| 3 | Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models | 提出BadVSFM,针对Prompt驱动的视频分割基础模型的后门攻击框架。 | foundation model | ||
| 4 | Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models | 提出Inherent-enhanced Multi-modal Calibration (IMC)框架,提升医学多模态大语言模型在噪声环境下的鲁棒性。 | large language model | ||
| 5 | SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis | SLIM-Brain:一种数据和训练高效的fMRI分析基础模型 | foundation model | ||
| 6 | Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models | 提出DIOR:一种免训练的条件图像嵌入框架,利用大型视觉语言模型。 | foundation model |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition | 提出PAN框架,通过人体中心图表示学习实现更有效的多模态动作识别。 | representation learning spatiotemporal multimodal | ||
| 8 | Yume-1.5: A Text-Controlled Interactive World Generation Model | Yume-1.5:一种文本控制的交互式世界生成模型,提升实时性和可控性。 | linear attention distillation |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration | 提出XET-V2X,用于V2X协同中多模态融合的端到端3D时空感知。 | spatiotemporal multimodal | ||
| 10 | LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration | LongFly:针对长程无人机视觉-语言导航,提出时空上下文融合框架 | spatiotemporal VLN multimodal |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer | 提出Reloc-VGGT,利用几何约束Transformer实现鲁棒高效的视觉重定位 | VGGT spatial relationship | ✅ |