cs.CV（2026-04-09）

📊 共 87 篇论文 | 🔗 25 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (31 🔗10) 支柱二：RL算法与架构 (RL & Architecture) (21 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (19 🔗6) 支柱六：视频提取与匹配 (Video Extraction) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (3) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱八：物理动画 (Physics-based Animation) (3) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (31 篇)

#	题目	一句话要点	标签	🔗
1	Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning	Brain3D：基于多模态推理的脑电信号到3D视觉表征解码	large language model multimodal
2	EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience	提出EEG2Vision框架，利用低密度脑电信号实现高质量视觉重建，并提升脑机接口应用潜力。	large language model multimodal
3	HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models	HAWK：多模态模型中基于头部重要性的视觉Token剪枝	large language model multimodal	✅
4	Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts	揭示多模态MoE模型“视而不思”现象，提出路由引导干预方法提升视觉推理能力。	multimodal
5	SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation	提出SyncBreaker，一种针对语音驱动人像生成的多模态对抗攻击框架。	multimodal	✅
6	DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection	提出双分支多模态框架DBMF，用于提升医学图像领域OOD检测性能。	multimodal
7	$\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization	提出SCADA-UL，通过对抗优化解决源域信息在免源域自适应中的泄露问题	zero-shot transfer	✅
8	Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment	提出PaveGPT，通过领域指令微调实现全面的自动化路面状况评估	foundation model
9	DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather	DinoRADE：利用视觉基础模型特征的全光谱雷达-相机融合，用于恶劣天气下的多类别目标检测	foundation model	✅
10	Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images	利用预训练DINOv3，高效标注的电影图像附件肿块分割	foundation model	✅
11	Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models	提出LogitProd，一种即插即用的病理学Foundation Model Logit融合方法，提升下游任务性能。	foundation model
12	Weight Group-wise Post-Training Quantization for Medical Foundation Model	针对医学大模型的权重分组后训练量化方法，提升终端设备推理速度	foundation model
13	AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation	AVGen-Bench：一个面向多粒度评估的文本到音视频生成任务驱动型基准	large language model multimodal
14	What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric	提出基于视觉-语言模型的语义注视路径相似度评估框架，弥补传统方法对语义信息的忽略。	foundation model multimodal
15	SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection	SciFigDetect：首个AI生成科学图检测基准，揭示现有检测方法在科学图像领域的不足。	multimodal zero-shot transfer	✅
16	Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding	提出Bridge-STG，解耦时空对齐，提升多模态大语言模型在视频定位任务中的性能。	large language model multimodal
17	Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation	提出Tarot-SAM3，一种无需训练的SAM3框架，用于任意指代表达式分割。	large language model multimodal
18	AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models	AgriChain：基于视觉专家验证推理的可解释农业视觉语言模型	multimodal chain-of-thought	✅
19	ParseBench: A Document Parsing Benchmark for AI Agents	提出ParseBench以解决文档解析中的语义正确性问题	visual grounding	✅
20	Phantasia: Context-Adaptive Backdoors in Vision Language Models	提出Phantasia：一种视觉语言模型中上下文自适应的后门攻击方法	multimodal
21	PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models	PokeGym：一个视觉驱动的、面向视觉-语言模型长程任务的评测基准。	visual grounding
22	Revisiting Radar Perception With Spectral Point Clouds	提出光谱点云，提升雷达感知模型在不同传感器间的泛化能力。	foundation model
23	DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning	提出DiffVC：一种基于扩散模型的非自回归视频字幕生成框架	multimodal
24	AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding	AdaSpark：面向高效长视频理解的自适应稀疏框架	large language model
25	Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments	提出FI3Det框架，利用视觉-语言模型实现动态室内环境下的少样本增量3D目标检测。	multimodal	✅
26	PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation	PanoSAM2：轻量级且考虑畸变与内存的SAM2自适应方法，用于360视频目标分割	embodied AI
27	RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs	提出RemoteAgent，利用强化学习对Agentic MLLM进行微调，解决遥感领域模糊意图理解问题。	large language model
28	Unified Multimodal Uncertain Inference	提出统一多模态不确定性推理框架UMUI，解决跨模态概率校准推理难题。	multimodal
29	MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments	MARINER：一个3E驱动的开放水域细粒度感知与复杂推理基准	large language model multimodal	✅
30	3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding	提出3D-VCD，通过视觉对比解码缓解3D具身智能体中的幻觉问题	multimodal
31	Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring	提出LeanGate，通过几何效用评分加速基于Transformer的单目SLAM	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

#	题目	一句话要点	标签	🔗
32	ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning	提出ABMamba，一种基于对齐分层双向扫描Mamba的高效视频字幕多模态大语言模型	Mamba state space model large language model
33	Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization	提出 Faithful GRPO，通过约束策略优化提升多模态语言模型中的视觉空间推理能力	reinforcement learning spatial relationship multimodal
34	MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models	MotionScape：用于世界模型的真实高动态无人机视频数据集	world model world models visual SLAM	✅
35	OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks	提出G$^2$RPO以解决多模态视觉任务中的奖励不均衡问题	reinforcement learning large language model multimodal
36	TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning	ToolCAD：提出基于强化学习的工具型大语言模型用于文本到CAD生成	reinforcement learning large language model chain-of-thought
37	OceanMAE: A Foundation Model for Ocean Remote Sensing	提出OceanMAE，融合物理信息的海洋遥感基础模型，提升海洋任务性能。	masked autoencoder MAE foundation model
38	Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models	提出HDPO框架，提升Agentic多模态模型在工具使用上的元认知能力和效率。	reinforcement learning multimodal
39	Self-Improving 4D Perception via Self-Distillation	提出SelfEvo自蒸馏框架，无需标注持续提升多视角4D重建模型性能	distillation depth estimation VGGT	✅
40	Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models	Orion-Lite：通过知识蒸馏将LLM推理能力赋予高效的纯视觉自动驾驶模型	distillation vision-language-action VLA
41	DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics	DailyArt：通过潜在动态从单张静态图像中发现铰接结构	world model world models latent dynamics	✅
42	Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning	提出TG-DP框架，通过解耦重建与对齐优化路径，提升音视频表征学习效果。	representation learning visual pre-training multimodal	✅
43	Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data	提出Fundus-R1，利用公共数据训练具备知识推理能力的眼底影像多模态大语言模型	reinforcement learning large language model multimodal
44	Small Vision-Language Models are Smart Compressors for Long Video Understanding	提出Tempo，利用小型视觉-语言模型高效压缩长视频，显著提升长视频理解性能。	distillation large language model multimodal
45	HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment	提出HST-HGN以解决驾驶员疲劳评估问题	Mamba state space model
46	EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization	EditCaption：通过监督微调和直接偏好优化实现图像编辑的人工对齐指令合成	DPO direct preference optimization
47	Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology	提出基于One-Class Representation Learning的罕见恶性细胞检测方法，解决计算细胞学中极度不平衡问题。	representation learning contrastive learning
48	Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator	Uni-ViGU：基于扩散模型的统一视频生成与理解框架	flow matching multimodal	✅
49	Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection	提出MDDCNet，结合可变形卷积与Mamba，提升多尺度交通目标检测精度。	Mamba	✅
50	LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving	LMGenDrive：融合多模态理解与生成式世界模型的端到端自动驾驶	world model world models multimodal
51	Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology	针对计算细胞学中罕见恶性细胞检测，提出基于One-Class Representation Learning的解决方案。	representation learning contrastive learning
52	InstrAct: Towards Action-Centric Understanding in Instructional Videos	InstrAct：面向教学视频，提出动作中心理解的预训练框架。	contrastive learning foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (19 篇)

#	题目	一句话要点	标签	🔗
53	Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification	揭示医学多模态大语言模型在图像分类中性能退化的原因与机理	semantic mapping semantic map large language model
54	Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting	提出基于生成式3D高斯溅射和尺度感知Transformer的大气降尺度和任意分辨率预测方法	3D gaussian splatting gaussian splatting splatting	✅
55	DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction	DP-DeGauss：用于自中心4D场景重建的动态概率高斯分解	scene reconstruction scene understanding egocentric
56	OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance	提出OVS-DINO以解决开放词汇分割中的边界感知问题	open-vocabulary open vocabulary foundation model
57	Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation	提出一种免训练的直接分割方法，用于开放词汇语义分割，无需logits优化。	semantic map open-vocabulary open vocabulary
58	GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting	GEAR：基于高斯溅射的几何-运动交替优化框架，用于铰接物体建模	gaussian splatting splatting
59	OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation	OV-Stitcher：提出全局上下文感知的免训练开放词汇语义分割框架	open-vocabulary open vocabulary
60	Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation	提出DcSConv自监督单目深度估计框架，解决深度变化导致的物体尺度模糊问题。	depth estimation monocular depth
61	Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach	提出基于扩散增强深度恢复的单目深度估计方法，提升特征表达能力。	depth estimation monocular depth	✅
62	SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction	SurfelSplat：学习高效且泛化的高斯Surfel表示，用于稀疏视角表面重建	3D gaussian splatting 3DGS gaussian splatting
63	CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning	提出CrashSight：面向交通碰撞场景理解的基础设施视角视频基准	scene understanding visual grounding	✅
64	LINE: LLM-based Iterative Neuron Explanations for Vision Models	LINE：基于LLM迭代式神经元解释的视觉模型分析方法	open-vocabulary open vocabulary large language model
65	SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations	SceneScribe-1M：大规模几何与语义标注视频数据集，促进3D感知与视频生成融合。	depth estimation monocular depth scene reconstruction
66	ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video	ReconPhys：提出单目视频重建外观和物理属性的快速前馈框架	3D gaussian splatting gaussian splatting splatting
67	Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction	Scal3R：用于大规模3D重建的可扩展测试时训练方法	scene reconstruction	✅
68	InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding	提出InstAP，通过实例感知预训练提升视觉-语言模型在时空理解上的能力。	scene understanding
69	Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting	提出基于生成3D高斯模型的气象预测框架以解决高分辨率输出问题	3D gaussian splatting gaussian splatting splatting	✅
70	OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation	OV-Stitcher：提出全局上下文感知的免训练开放词汇语义分割框架	open-vocabulary open vocabulary
71	CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning	提出CrashSight以解决交通事故场景理解问题	scene understanding visual grounding	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签	🔗
72	UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding	提出UniversalVTG，一种轻量级通用视频时序定位基础模型	Ego4D foundation model multimodal
73	E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation	提出E-3DPSM，用于事件相机在单目自中心3D人体姿态估计中提升精度与稳定性。	egocentric human motion
74	ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets	ETCH-X：通过可组合数据集增强服装人体模型的鲁棒性和表达性	SMPL SMPL-X	✅
75	GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics	提出Skelebones以解决非刚性表面动画控制问题	motion matching

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签
76	Coordinate-Based Dual-Constrained Autoregressive Motion Generation	提出基于坐标和双重约束的自回归运动生成框架CDAMD，提升文本到动作生成质量。	text-to-motion motion synthesis motion generation
77	Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics	Phantom：通过联合建模视觉和潜在物理动力学实现物理信息注入的视频生成	physically plausible
78	Guiding a Diffusion Model by Swapping Its Tokens	提出Self-Swap Guidance，通过token交换引导扩散模型，提升图像质量和提示对齐性。	classifier-free guidance

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗
79	BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields	提出BLaDA以解决功能性灵巧抓取中的语义与姿态耦合问题	manipulation dexterous manipulation 3D gaussian splatting	✅
80	LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation	LAMP：利用图像编辑作为通用3D先验，实现开放世界操作	manipulation reinforcement learning imitation learning	✅
81	Visually-grounded Humanoid Agents	提出基于视觉的人形智能体框架，实现3D场景中自主行为	humanoid embodied AI

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签
82	LPM 1.0: Video-based Character Performance Model	LPM 1.0：提出基于视频的角色表演模型，解决高表现力、实时推理和身份稳定性三难问题。	interactive character multimodal
83	ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks	ImVideoEdit：通过2D空间差异注意力块实现基于图像学习的视频编辑	spatiotemporal
84	Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation	Stitch4D：通过时空插值实现稀疏多视角4D城市重建	spatiotemporal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
85	3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience	3DrawAgent：利用对比经验教LLM在3D空间中进行绘画	spatial relationship large language model
86	EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition	提出EPIR框架，通过高效的token化、集成和表征学习提升微表情识别性能并降低计算复杂度。	spatial relationship

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
87	PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction	提出PolySLGen，用于多人交互中在线多模态听说反应生成	dyadic interaction embodied AI multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-04-09）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (31 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (19 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理