cs.CV(2025-10-01)

📊 共 33 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents 提出PAL-UI框架,通过主动回溯机制提升视觉GUI Agent在长程任务中的规划能力。 large language model multimodal
2 A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT-2 XL and NVIDIA H100 提出基于GPT-2 XL和NVIDIA H100的深度学习管线,用于癫痫基因组分析。 large language model
3 Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models 提出SF-SPA框架,利用视觉-语言模型评估建筑立面的光伏安装潜力 large language model foundation model
4 From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding 提出视频到索引知识图谱框架,融合多模态内容分析与理解方法 multimodal
5 SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs SPUS:一种轻量级且参数高效的偏微分方程基础模型 foundation model
6 Graph Integrated Multimodal Concept Bottleneck Model 提出MoE-SGT,通过图Transformer和混合专家模型增强多模态概念瓶颈模型,提升复杂概念推理能力。 multimodal
7 Assessing Foundation Models for Mold Colony Detection with Limited Training Data 利用少量训练数据,评估真菌菌落检测的基础模型性能 foundation model
8 CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab? CardioBench:评估心动超声影像基础模型泛化能力的标准化基准 foundation model
9 Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs 提出一种免训练的MLLM不确定性引导框架,用于复杂视觉任务。 large language model multimodal
10 Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories 提出XMAS方法,通过跨模态对齐轨迹进行视觉语言模型高效数据选择。 large language model
11 IMAGEdit: Let Any Subject Transform IMAGEdit:提出一种免训练框架,实现任意数量视频主体的外观变换。 multimodal
12 KeySG: Hierarchical Keyframe-Based 3D Scene Graphs KeySG:基于分层关键帧的3D场景图构建,提升语义丰富性和可扩展性 large language model
13 ProtoMask: Segmentation-Guided Prototype Learning ProtoMask:提出一种基于分割引导的原型学习方法,提升原型可解释性。 foundation model
14 CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation CML-Bench:用于评估和提升大语言模型生成电影剧本的框架 large language model
15 Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation 提出COFA,通过在线增强解耦前景与背景特征,提升视觉语言导航泛化性 VLN

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
16 Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation 提出自适应事件流切片与知识蒸馏框架,实现开放词汇事件相机目标检测 distillation open-vocabulary open vocabulary
17 Efficient Multi-modal Large Language Models via Progressive Consistency Distillation 提出EPIC框架,通过渐进一致性蒸馏提升多模态大模型的效率 distillation large language model
18 Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model 提出Gather-Scatter Mamba,结合注意力机制与选择性SSM加速视频超分中的时序传播。 Mamba state space model
19 JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation 提出JEPA-T,通过文本融合的联合嵌入预测架构提升图像生成效果 flow matching open-vocabulary open vocabulary
20 Can World Models Benefit VLMs for World Dynamics? 提出WorldLM,利用世界模型先验增强视觉语言模型的世界动态理解能力 world model multimodal
21 EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory EvoWorld:利用显式3D记忆演化的全景世界生成模型 world model geometric consistency
22 Feature Identification for Hierarchical Contrastive Learning 提出两种层级对比学习方法,利用层级关系提升细粒度分类性能。 contrastive learning
23 POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency 提出POVQA:一种数据高效的偏好优化视频问答方法,利用理由提升性能。 DPO direct preference optimization

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
24 Affordance-Guided Diffusion Prior for 3D Hand Reconstruction 提出基于可供性的扩散先验,用于解决3D手部重建中严重遮挡问题 affordance HOI affordance-aware
25 PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset 提出PhraseStereo:首个开放词汇立体图像分割数据集,促进多模态语义理解。 open-vocabulary open vocabulary multimodal
26 Instant4D: 4D Gaussian Splatting in Minutes Instant4D:分钟级实现基于单目视频的4D高斯溅射动态场景重建 visual SLAM gaussian splatting splatting
27 OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding OTTER:通过文本-图像表征进行开放标签多模态理解 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
28 EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels EvoStruggle:构建技能学习过程中挣扎演变数据集,用于提升辅助系统性能。 manipulation
29 Code2Video: A Code-centric Paradigm for Educational Video Generation 提出Code2Video框架,通过可执行代码生成专业教育视频,提升可控性和教学质量。 manipulation
30 MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles 提出MathSticks:一个用于视觉符号组合推理的火柴棍谜题基准 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
31 Arbitrary Generative Video Interpolation 提出ArbInterp,实现任意时间戳和长度的生成式视频插帧。 spatiotemporal
32 Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning 提出基于LoRA的自适应共享专家混合模型,提升多任务学习性能 ASE

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
33 BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration BindWeave:通过跨模态融合实现主体一致的视频生成 spatial relationship large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页