cs.CV（2025-10-06）

📊 共 29 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior	提出Pathology-CoT框架，从专家WSI诊断行为中学习视觉链式推理Agent	foundation model chain-of-thought
2	ActiveMark: on watermarking of visual foundation models via massive activations	提出ActiveMark以解决视觉基础模型的水印保护问题	foundation model
3	A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification	提出空间-光谱-频率交互网络S²Fin，用于提升多模态遥感图像分类精度。	multimodal	✅
4	Factuality Matters: When Image Generation and Editing Meet Structured Visuals	针对结构化视觉生成与编辑的事实性问题，提出StructBench基准和多模态融合模型。	multimodal chain-of-thought
5	MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models	MedCLM：通过CoT课程学习医学视觉语言模型中的定位和推理	visual grounding chain-of-thought
6	VChain: Chain-of-Visual-Thought for Reasoning in Video Generation	VChain：用于视频生成中推理的视觉思维链	multimodal
7	Character Mixing for Video Generation	提出CCE和CCA框架，实现跨世界观角色融合的视频生成，解决风格退化问题。	multimodal	✅
8	Visual Representations inside the Language Model	分析多模态大语言模型内部视觉表征，揭示其感知能力瓶颈与改进方向	multimodal
9	Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics	提出基于Transformer的对话动态人体识别方法，提升自然交互场景下身份识别精度。	multimodal
10	ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion	提出Blendshape引导的扩散模型，实现身份保持和精准表情生成。	foundation model	✅
11	VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery	提出VaseVQA-3D数据集和VaseVLM模型，解决3D文物领域视觉问答的数据稀缺和知识不足问题。	multimodal	✅
12	Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting	VLMCountBench揭示视觉语言模型在组合计数任务上的显著缺陷	embodied AI

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Benchmark on Monocular Metric Depth Estimation in Wildlife Setting	构建野生动物场景下单目深度估计基准，评估现有方法性能。	MAE depth estimation monocular depth
14	Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models	全面剖析视频大模型后训练方法，提升视频推理能力	reinforcement learning reward design spatiotemporal	✅
15	Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction	提出面向对象的表征学习方法，提升3D场景图预测精度	representation learning open-vocabulary open vocabulary	✅
16	Conditional Representation Learning for Customized Tasks	提出条件表示学习(CRL)，为定制任务提取特定语义的图像表征。	representation learning large language model	✅
17	A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation	对比ViT与CNN在少样本刚性变换和本质矩阵估计中的性能，揭示不同数据规模下的架构选择策略。	contrastive learning scene reconstruction foundation model
18	ERDE: Entropy-Regularized Distillation for Early-exit	提出基于熵正则化的知识蒸馏早期退出方法，提升边缘设备图像分类效率。	distillation
19	Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation	提出AT-BPTT，通过自动内循环优化提升数据集蒸馏性能。	distillation
20	EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents	EduPersona：评估虚拟学生Agent主观能力的基准测试	teacher-student large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction	提出PG-Occ框架，通过渐进式高斯Transformer实现开放词汇三维 occupancy 预测。	scene understanding open-vocabulary open vocabulary	✅
22	Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning	提出基于有界分布估计的开放词汇学习方法，通过生成未见类数据提升泛化能力。	open-vocabulary open vocabulary
23	See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models	提出基于视觉语言模型的时序逆转场景重建方法，利用热成像痕迹推断过去场景状态。	scene reconstruction
24	AvatarVTON: 4D Virtual Try-On for Animatable Avatars	AvatarVTON：提出首个用于可动画Avatar的4D虚拟试穿框架	optical flow

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
25	General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks	提出基于对象无关掩码的视觉目标条件强化学习方法，提升泛化性和效率	sim-to-real reinforcement learning open-vocabulary
26	Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization	提出一种自动化双机器人扫描系统，用于文化遗产高精度三维数字化	manipulation motion planning

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors	EgoSurg：基于环境传感器，为手术室工作流程重建任意视角的自我中心回放。	egocentric
28	SegMASt3R: Geometry Grounded Segment Matching	SegMASt3R：利用3D基础模型实现几何感知的图像分割匹配	feature matching foundation model

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems	提出基于深度传感器的群体交互识别方法，用于增强网络物理社会基础设施系统中的社会感知。	dyadic interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页