cs.CV（2025-09-20）

📊 共 15 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (4) 支柱二：RL算法与架构 (RL & Architecture) (3 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache	KV-Efficient VLA：利用RNN门控分块KV缓存加速视觉语言模型	vision-language-action VLA
2	MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation	MMPart：利用多模态大语言模型进行部件感知的3D生成	large language model
3	Animalbooth: multimodal feature enhancement for animal subject personalization	AnimalBooth：通过多模态特征增强实现动物主题个性化图像生成	multimodal
4	Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model	利用微调的地理空间基础模型进行城市热岛检测与模拟	foundation model
5	Advancing Reference-free Evaluation of Video Captions with Factual Analysis	提出VC-Inspector，一种基于事实分析的视频字幕无参考评价框架	large language model multimodal
6	Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence	提出ActiSeg-NL基准，研究标签噪声下动作引导的视频分割，并提出PMHM提升鲁棒性。	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding	Text-Scene：提出一种场景到语言的解析框架，用于3D场景理解。	scene understanding affordance spatial relationship
8	ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting	提出ST-GS框架，通过时空高斯溅射提升视觉中心自动驾驶中的3D语义占据预测	gaussian splatting splatting scene understanding
9	MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging	MedGS：基于高斯溅射的多模态3D医学影像重建与插值	gaussian splatting splatting
10	SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving	SQS：基于查询Splatting增强自动驾驶稀疏感知模型	splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery	Surgical-MambaLLM：基于Mamba2增强的多模态大语言模型，用于机器人手术中的视觉问题定位回答	Mamba large language model multimodal
12	Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment	利用文本提示学习高光谱图像，实现高效多模态对齐	distillation scene understanding HSI
13	Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization	提出CaRe-DPO框架，通过双组直接偏好优化提升文本-视频检索中字幕生成质量。	DPO direct preference optimization large language model	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose	I2S框架：利用3D手部姿态进行人-物交互的用户身份识别	manipulation bi-manual human-object interaction

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis	提出HyPlaneHead，通过混合平面表示实现高质量全头部图像合成	penetration

⬅️ 返回 cs.CV 首页 · 🏠 返回主页