cs.CV(2025-09-02)

📊 共 15 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery StrokeVision-Bench:用于跟踪中风恢复的多模态视频和2D姿态基准数据集 multimodal
2 Toward a robust lesion detection model in breast DCE-MRI: adapting foundation models to high-risk women 针对高危女性,提出基于医学切片Transformer和KAN的乳腺DCE-MRI病灶检测模型。 foundation model
3 MedDINOv3: How to adapt vision foundation models for medical image segmentation? MedDINOv3:一种用于医学图像分割的视觉基础模型自适应方法 foundation model
4 OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds OmniActor:一种用于2D和3D世界的通用GUI和具身智能体 generalist agent large language model multimodal
5 A Multimodal Cross-View Model for Predicting Postoperative Neck Pain in Cervical Spondylosis Patients 提出ABPDC和FPRAN模型,预测颈椎病患者术后颈部疼痛恢复情况 multimodal
6 FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection 提出FusWay多模态融合方法,用于提升铁路缺陷检测精度。 multimodal
7 Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture 系统分析MLLM空间理解能力瓶颈,提出MulSeT基准并探究数据与架构的影响 large language model multimodal
8 DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining DIET-CP:轻量级且数据高效的自监督持续预训练方法 foundation model
9 Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks 提出RocketScience基准,揭示现有VLM在空间关系理解上的不足 chain-of-thought

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
10 Omnidirectional Spatial Modeling from Correlated Panoramas 提出CFpano数据集与多模态大语言模型以解决全景图像理解问题 scene understanding embodied AI large language model
11 FastVGGT: Training-Free Acceleration of Visual Geometry Transformer FastVGGT:通过无训练Token合并加速视觉几何Transformer,提升3D视觉效率。 VGGT foundation model
12 Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery 提出Motion-Refined DINOSAUR,用于无监督多目标发现,无需伪标签。 optical flow

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
13 PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding? PixFoundation 2.0:评估视频多模态LLM在视觉定位中对运动信息的利用程度 spatiotemporal large language model visual grounding

🔬 支柱二:RL算法与架构 (RL & Architecture) (1 篇)

#题目一句话要点标签🔗
14 Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion 提出基于强化学习的协同蒸馏与自学习框架,用于红外-可见光图像融合。 reinforcement learning teacher-student distillation

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
15 SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images SynthGenNet:利用合成街景图像多源域混合实现测试时泛化的自监督方法 sim-to-real contrastive learning distillation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页