cs.CV（2025-10-01）

📊 共 33 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗3) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents	提出PAL-UI框架，通过主动回溯机制提升视觉GUI Agent在长程任务中的规划能力。	large language model multimodal
2	A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT-2 XL and NVIDIA H100	提出基于GPT-2 XL和NVIDIA H100的深度学习管线，用于癫痫基因组分析。	large language model
3	Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models	提出SF-SPA框架，利用视觉-语言模型评估建筑立面的光伏安装潜力	large language model foundation model
4	From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding	提出视频到索引知识图谱框架，融合多模态内容分析与理解方法	multimodal
5	SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs	SPUS：一种轻量级且参数高效的偏微分方程基础模型	foundation model
6	Graph Integrated Multimodal Concept Bottleneck Model	提出MoE-SGT，通过图Transformer和混合专家模型增强多模态概念瓶颈模型，提升复杂概念推理能力。	multimodal
7	Assessing Foundation Models for Mold Colony Detection with Limited Training Data	利用少量训练数据，评估真菌菌落检测的基础模型性能	foundation model
8	CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?	CardioBench：评估心动超声影像基础模型泛化能力的标准化基准	foundation model
9	Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs	提出一种免训练的MLLM不确定性引导框架，用于复杂视觉任务。	large language model multimodal
10	Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories	提出XMAS方法，通过跨模态对齐轨迹进行视觉语言模型高效数据选择。	large language model	✅
11	IMAGEdit: Let Any Subject Transform	IMAGEdit：提出一种免训练框架，实现任意数量视频主体的外观变换。	multimodal	✅
12	KeySG: Hierarchical Keyframe-Based 3D Scene Graphs	KeySG：基于分层关键帧的3D场景图构建，提升语义丰富性和可扩展性	large language model
13	ProtoMask: Segmentation-Guided Prototype Learning	ProtoMask：提出一种基于分割引导的原型学习方法，提升原型可解释性。	foundation model	✅
14	CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation	CML-Bench：用于评估和提升大语言模型生成电影剧本的框架	large language model
15	Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation	提出COFA，通过在线增强解耦前景与背景特征，提升视觉语言导航泛化性	VLN

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation	提出自适应事件流切片与知识蒸馏框架，实现开放词汇事件相机目标检测	distillation open-vocabulary open vocabulary
17	Efficient Multi-modal Large Language Models via Progressive Consistency Distillation	提出EPIC框架，通过渐进一致性蒸馏提升多模态大模型的效率	distillation large language model
18	Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model	提出Gather-Scatter Mamba，结合注意力机制与选择性SSM加速视频超分中的时序传播。	Mamba state space model	✅
19	JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation	提出JEPA-T，通过文本融合的联合嵌入预测架构提升图像生成效果	flow matching open-vocabulary open vocabulary	✅
20	Can World Models Benefit VLMs for World Dynamics?	提出WorldLM，利用世界模型先验增强视觉语言模型的世界动态理解能力	world model multimodal
21	EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory	EvoWorld：利用显式3D记忆演化的全景世界生成模型	world model geometric consistency
22	Feature Identification for Hierarchical Contrastive Learning	提出两种层级对比学习方法，利用层级关系提升细粒度分类性能。	contrastive learning
23	POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency	提出POVQA：一种数据高效的偏好优化视频问答方法，利用理由提升性能。	DPO direct preference optimization

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Affordance-Guided Diffusion Prior for 3D Hand Reconstruction	提出基于可供性的扩散先验，用于解决3D手部重建中严重遮挡问题	affordance HOI affordance-aware
25	PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset	提出PhraseStereo：首个开放词汇立体图像分割数据集，促进多模态语义理解。	open-vocabulary open vocabulary multimodal
26	Instant4D: 4D Gaussian Splatting in Minutes	Instant4D：分钟级实现基于单目视频的4D高斯溅射动态场景重建	visual SLAM gaussian splatting splatting	✅
27	OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding	OTTER：通过文本-图像表征进行开放标签多模态理解	open-vocabulary open vocabulary

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
28	EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels	EvoStruggle：构建技能学习过程中挣扎演变数据集，用于提升辅助系统性能。	manipulation	✅
29	Code2Video: A Code-centric Paradigm for Educational Video Generation	提出Code2Video框架，通过可执行代码生成专业教育视频，提升可控性和教学质量。	manipulation	✅
30	MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles	提出MathSticks：一个用于视觉符号组合推理的火柴棍谜题基准	manipulation	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Arbitrary Generative Video Interpolation	提出ArbInterp，实现任意时间戳和长度的生成式视频插帧。	spatiotemporal	✅
32	Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning	提出基于LoRA的自适应共享专家混合模型，提升多任务学习性能	ASE

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration	BindWeave：通过跨模态融合实现主体一致的视频生成	spatial relationship large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页