cs.CV(2025-09-20)
📊 共 15 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (4)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗1)
支柱一:机器人控制 (Robot Control) (1)
支柱四:生成式动作 (Generative Motion) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache | KV-Efficient VLA:利用RNN门控分块KV缓存加速视觉语言模型 | vision-language-action VLA | ||
| 2 | MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation | MMPart:利用多模态大语言模型进行部件感知的3D生成 | large language model | ||
| 3 | Animalbooth: multimodal feature enhancement for animal subject personalization | AnimalBooth:通过多模态特征增强实现动物主题个性化图像生成 | multimodal | ||
| 4 | Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model | 利用微调的地理空间基础模型进行城市热岛检测与模拟 | foundation model | ||
| 5 | Advancing Reference-free Evaluation of Video Captions with Factual Analysis | 提出VC-Inspector,一种基于事实分析的视频字幕无参考评价框架 | large language model multimodal | ||
| 6 | Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence | 提出ActiSeg-NL基准,研究标签噪声下动作引导的视频分割,并提出PMHM提升鲁棒性。 | multimodal | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding | Text-Scene:提出一种场景到语言的解析框架,用于3D场景理解。 | scene understanding affordance spatial relationship | ||
| 8 | ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting | 提出ST-GS框架,通过时空高斯溅射提升视觉中心自动驾驶中的3D语义占据预测 | gaussian splatting splatting scene understanding | ||
| 9 | MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging | MedGS:基于高斯溅射的多模态3D医学影像重建与插值 | gaussian splatting splatting | ||
| 10 | SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving | SQS:基于查询Splatting增强自动驾驶稀疏感知模型 | splatting |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery | Surgical-MambaLLM:基于Mamba2增强的多模态大语言模型,用于机器人手术中的视觉问题定位回答 | Mamba large language model multimodal | ||
| 12 | Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment | 利用文本提示学习高光谱图像,实现高效多模态对齐 | distillation scene understanding HSI | ||
| 13 | Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization | 提出CaRe-DPO框架,通过双组直接偏好优化提升文本-视频检索中字幕生成质量。 | DPO direct preference optimization large language model | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose | I2S框架:利用3D手部姿态进行人-物交互的用户身份识别 | manipulation bi-manual human-object interaction |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis | 提出HyPlaneHead,通过混合平面表示实现高质量全头部图像合成 | penetration |