cs.CV(2025-10-06)

📊 共 29 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior 提出Pathology-CoT框架,从专家WSI诊断行为中学习视觉链式推理Agent foundation model chain-of-thought
2 ActiveMark: on watermarking of visual foundation models via massive activations 提出ActiveMark以解决视觉基础模型的水印保护问题 foundation model
3 A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification 提出空间-光谱-频率交互网络S²Fin,用于提升多模态遥感图像分类精度。 multimodal
4 Factuality Matters: When Image Generation and Editing Meet Structured Visuals 针对结构化视觉生成与编辑的事实性问题,提出StructBench基准和多模态融合模型。 multimodal chain-of-thought
5 MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models MedCLM:通过CoT课程学习医学视觉语言模型中的定位和推理 visual grounding chain-of-thought
6 VChain: Chain-of-Visual-Thought for Reasoning in Video Generation VChain:用于视频生成中推理的视觉思维链 multimodal
7 Character Mixing for Video Generation 提出CCE和CCA框架,实现跨世界观角色融合的视频生成,解决风格退化问题。 multimodal
8 Visual Representations inside the Language Model 分析多模态大语言模型内部视觉表征,揭示其感知能力瓶颈与改进方向 multimodal
9 Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics 提出基于Transformer的对话动态人体识别方法,提升自然交互场景下身份识别精度。 multimodal
10 ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion 提出Blendshape引导的扩散模型,实现身份保持和精准表情生成。 foundation model
11 VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery 提出VaseVQA-3D数据集和VaseVLM模型,解决3D文物领域视觉问答的数据稀缺和知识不足问题。 multimodal
12 Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting VLMCountBench揭示视觉语言模型在组合计数任务上的显著缺陷 embodied AI

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
13 Benchmark on Monocular Metric Depth Estimation in Wildlife Setting 构建野生动物场景下单目深度估计基准,评估现有方法性能。 MAE depth estimation monocular depth
14 Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models 全面剖析视频大模型后训练方法,提升视频推理能力 reinforcement learning reward design spatiotemporal
15 Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction 提出面向对象的表征学习方法,提升3D场景图预测精度 representation learning open-vocabulary open vocabulary
16 Conditional Representation Learning for Customized Tasks 提出条件表示学习(CRL),为定制任务提取特定语义的图像表征。 representation learning large language model
17 A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation 对比ViT与CNN在少样本刚性变换和本质矩阵估计中的性能,揭示不同数据规模下的架构选择策略。 contrastive learning scene reconstruction foundation model
18 ERDE: Entropy-Regularized Distillation for Early-exit 提出基于熵正则化的知识蒸馏早期退出方法,提升边缘设备图像分类效率。 distillation
19 Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation 提出AT-BPTT,通过自动内循环优化提升数据集蒸馏性能。 distillation
20 EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents EduPersona:评估虚拟学生Agent主观能力的基准测试 teacher-student large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
21 Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction 提出PG-Occ框架,通过渐进式高斯Transformer实现开放词汇三维 occupancy 预测。 scene understanding open-vocabulary open vocabulary
22 Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning 提出基于有界分布估计的开放词汇学习方法,通过生成未见类数据提升泛化能力。 open-vocabulary open vocabulary
23 See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models 提出基于视觉语言模型的时序逆转场景重建方法,利用热成像痕迹推断过去场景状态。 scene reconstruction
24 AvatarVTON: 4D Virtual Try-On for Animatable Avatars AvatarVTON:提出首个用于可动画Avatar的4D虚拟试穿框架 optical flow

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
25 General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks 提出基于对象无关掩码的视觉目标条件强化学习方法,提升泛化性和效率 sim-to-real reinforcement learning open-vocabulary
26 Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization 提出一种自动化双机器人扫描系统,用于文化遗产高精度三维数字化 manipulation motion planning

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
27 Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors EgoSurg:基于环境传感器,为手术室工作流程重建任意视角的自我中心回放。 egocentric
28 SegMASt3R: Geometry Grounded Segment Matching SegMASt3R:利用3D基础模型实现几何感知的图像分割匹配 feature matching foundation model

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
29 Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems 提出基于深度传感器的群体交互识别方法,用于增强网络物理社会基础设施系统中的社会感知。 dyadic interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页