cs.CV（2025-09-02）

📊 共 15 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery	StrokeVision-Bench：用于跟踪中风恢复的多模态视频和2D姿态基准数据集	multimodal
2	Toward a robust lesion detection model in breast DCE-MRI: adapting foundation models to high-risk women	针对高危女性，提出基于医学切片Transformer和KAN的乳腺DCE-MRI病灶检测模型。	foundation model
3	MedDINOv3: How to adapt vision foundation models for medical image segmentation?	MedDINOv3：一种用于医学图像分割的视觉基础模型自适应方法	foundation model	✅
4	OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds	OmniActor：一种用于2D和3D世界的通用GUI和具身智能体	generalist agent large language model multimodal
5	A Multimodal Cross-View Model for Predicting Postoperative Neck Pain in Cervical Spondylosis Patients	提出ABPDC和FPRAN模型，预测颈椎病患者术后颈部疼痛恢复情况	multimodal
6	FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection	提出FusWay多模态融合方法，用于提升铁路缺陷检测精度。	multimodal
7	Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture	系统分析MLLM空间理解能力瓶颈，提出MulSeT基准并探究数据与架构的影响	large language model multimodal
8	DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining	DIET-CP：轻量级且数据高效的自监督持续预训练方法	foundation model
9	Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks	提出RocketScience基准，揭示现有VLM在空间关系理解上的不足	chain-of-thought	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
10	Omnidirectional Spatial Modeling from Correlated Panoramas	提出CFpano数据集与多模态大语言模型以解决全景图像理解问题	scene understanding embodied AI large language model
11	FastVGGT: Training-Free Acceleration of Visual Geometry Transformer	FastVGGT：通过无训练Token合并加速视觉几何Transformer，提升3D视觉效率。	VGGT foundation model	✅
12	Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery	提出Motion-Refined DINOSAUR，用于无监督多目标发现，无需伪标签。	optical flow

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
13	PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?	PixFoundation 2.0：评估视频多模态LLM在视觉定位中对运动信息的利用程度	spatiotemporal large language model visual grounding	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion	提出基于强化学习的协同蒸馏与自学习框架，用于红外-可见光图像融合。	reinforcement learning teacher-student distillation

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images	SynthGenNet：利用合成街景图像多源域混合实现测试时泛化的自监督方法	sim-to-real contrastive learning distillation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页