cs.CV(2026-04-09)

📊 共 87 篇论文 | 🔗 25 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (31 🔗10) 支柱二:RL算法与架构 (RL & Architecture) (21 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (19 🔗6) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (3) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱八:物理动画 (Physics-based Animation) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (31 篇)

#题目一句话要点标签🔗
1 Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning Brain3D:基于多模态推理的脑电信号到3D视觉表征解码 large language model multimodal
2 EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience 提出EEG2Vision框架,利用低密度脑电信号实现高质量视觉重建,并提升脑机接口应用潜力。 large language model multimodal
3 HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models HAWK:多模态模型中基于头部重要性的视觉Token剪枝 large language model multimodal
4 Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts 揭示多模态MoE模型“视而不思”现象,提出路由引导干预方法提升视觉推理能力。 multimodal
5 SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation 提出SyncBreaker,一种针对语音驱动人像生成的多模态对抗攻击框架。 multimodal
6 DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection 提出双分支多模态框架DBMF,用于提升医学图像领域OOD检测性能。 multimodal
7 $\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization 提出SCADA-UL,通过对抗优化解决源域信息在免源域自适应中的泄露问题 zero-shot transfer
8 Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment 提出PaveGPT,通过领域指令微调实现全面的自动化路面状况评估 foundation model
9 DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather DinoRADE:利用视觉基础模型特征的全光谱雷达-相机融合,用于恶劣天气下的多类别目标检测 foundation model
10 Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images 利用预训练DINOv3,高效标注的电影图像附件肿块分割 foundation model
11 Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models 提出LogitProd,一种即插即用的病理学Foundation Model Logit融合方法,提升下游任务性能。 foundation model
12 Weight Group-wise Post-Training Quantization for Medical Foundation Model 针对医学大模型的权重分组后训练量化方法,提升终端设备推理速度 foundation model
13 AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation AVGen-Bench:一个面向多粒度评估的文本到音视频生成任务驱动型基准 large language model multimodal
14 What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric 提出基于视觉-语言模型的语义注视路径相似度评估框架,弥补传统方法对语义信息的忽略。 foundation model multimodal
15 SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection SciFigDetect:首个AI生成科学图检测基准,揭示现有检测方法在科学图像领域的不足。 multimodal zero-shot transfer
16 Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding 提出Bridge-STG,解耦时空对齐,提升多模态大语言模型在视频定位任务中的性能。 large language model multimodal
17 Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation 提出Tarot-SAM3,一种无需训练的SAM3框架,用于任意指代表达式分割。 large language model multimodal
18 AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models AgriChain:基于视觉专家验证推理的可解释农业视觉语言模型 multimodal chain-of-thought
19 ParseBench: A Document Parsing Benchmark for AI Agents 提出ParseBench以解决文档解析中的语义正确性问题 visual grounding
20 Phantasia: Context-Adaptive Backdoors in Vision Language Models 提出Phantasia:一种视觉语言模型中上下文自适应的后门攻击方法 multimodal
21 PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models PokeGym:一个视觉驱动的、面向视觉-语言模型长程任务的评测基准。 visual grounding
22 Revisiting Radar Perception With Spectral Point Clouds 提出光谱点云,提升雷达感知模型在不同传感器间的泛化能力。 foundation model
23 DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning 提出DiffVC:一种基于扩散模型的非自回归视频字幕生成框架 multimodal
24 AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding AdaSpark:面向高效长视频理解的自适应稀疏框架 large language model
25 Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments 提出FI3Det框架,利用视觉-语言模型实现动态室内环境下的少样本增量3D目标检测。 multimodal
26 PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation PanoSAM2:轻量级且考虑畸变与内存的SAM2自适应方法,用于360视频目标分割 embodied AI
27 RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs 提出RemoteAgent,利用强化学习对Agentic MLLM进行微调,解决遥感领域模糊意图理解问题。 large language model
28 Unified Multimodal Uncertain Inference 提出统一多模态不确定性推理框架UMUI,解决跨模态概率校准推理难题。 multimodal
29 MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments MARINER:一个3E驱动的开放水域细粒度感知与复杂推理基准 large language model multimodal
30 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding 提出3D-VCD,通过视觉对比解码缓解3D具身智能体中的幻觉问题 multimodal
31 Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring 提出LeanGate,通过几何效用评分加速基于Transformer的单目SLAM foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (21 篇)

#题目一句话要点标签🔗
32 ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning 提出ABMamba,一种基于对齐分层双向扫描Mamba的高效视频字幕多模态大语言模型 Mamba state space model large language model
33 Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization 提出 Faithful GRPO,通过约束策略优化提升多模态语言模型中的视觉空间推理能力 reinforcement learning spatial relationship multimodal
34 MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models MotionScape:用于世界模型的真实高动态无人机视频数据集 world model world models visual SLAM
35 OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks 提出G$^2$RPO以解决多模态视觉任务中的奖励不均衡问题 reinforcement learning large language model multimodal
36 TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning ToolCAD:提出基于强化学习的工具型大语言模型用于文本到CAD生成 reinforcement learning large language model chain-of-thought
37 OceanMAE: A Foundation Model for Ocean Remote Sensing 提出OceanMAE,融合物理信息的海洋遥感基础模型,提升海洋任务性能。 masked autoencoder MAE foundation model
38 Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models 提出HDPO框架,提升Agentic多模态模型在工具使用上的元认知能力和效率。 reinforcement learning multimodal
39 Self-Improving 4D Perception via Self-Distillation 提出SelfEvo自蒸馏框架,无需标注持续提升多视角4D重建模型性能 distillation depth estimation VGGT
40 Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models Orion-Lite:通过知识蒸馏将LLM推理能力赋予高效的纯视觉自动驾驶模型 distillation vision-language-action VLA
41 DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics DailyArt:通过潜在动态从单张静态图像中发现铰接结构 world model world models latent dynamics
42 Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning 提出TG-DP框架,通过解耦重建与对齐优化路径,提升音视频表征学习效果。 representation learning visual pre-training multimodal
43 Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data 提出Fundus-R1,利用公共数据训练具备知识推理能力的眼底影像多模态大语言模型 reinforcement learning large language model multimodal
44 Small Vision-Language Models are Smart Compressors for Long Video Understanding 提出Tempo,利用小型视觉-语言模型高效压缩长视频,显著提升长视频理解性能。 distillation large language model multimodal
45 HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment 提出HST-HGN以解决驾驶员疲劳评估问题 Mamba state space model
46 EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization EditCaption:通过监督微调和直接偏好优化实现图像编辑的人工对齐指令合成 DPO direct preference optimization
47 Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology 提出基于One-Class Representation Learning的罕见恶性细胞检测方法,解决计算细胞学中极度不平衡问题。 representation learning contrastive learning
48 Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator Uni-ViGU:基于扩散模型的统一视频生成与理解框架 flow matching multimodal
49 Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection 提出MDDCNet,结合可变形卷积与Mamba,提升多尺度交通目标检测精度。 Mamba
50 LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving LMGenDrive:融合多模态理解与生成式世界模型的端到端自动驾驶 world model world models multimodal
51 Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology 针对计算细胞学中罕见恶性细胞检测,提出基于One-Class Representation Learning的解决方案。 representation learning contrastive learning
52 InstrAct: Towards Action-Centric Understanding in Instructional Videos InstrAct:面向教学视频,提出动作中心理解的预训练框架。 contrastive learning foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (19 篇)

#题目一句话要点标签🔗
53 Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification 揭示医学多模态大语言模型在图像分类中性能退化的原因与机理 semantic mapping semantic map large language model
54 Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting 提出基于生成式3D高斯溅射和尺度感知Transformer的大气降尺度和任意分辨率预测方法 3D gaussian splatting gaussian splatting splatting
55 DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction DP-DeGauss:用于自中心4D场景重建的动态概率高斯分解 scene reconstruction scene understanding egocentric
56 OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance 提出OVS-DINO以解决开放词汇分割中的边界感知问题 open-vocabulary open vocabulary foundation model
57 Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation 提出一种免训练的直接分割方法,用于开放词汇语义分割,无需logits优化。 semantic map open-vocabulary open vocabulary
58 GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting GEAR:基于高斯溅射的几何-运动交替优化框架,用于铰接物体建模 gaussian splatting splatting
59 OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation OV-Stitcher:提出全局上下文感知的免训练开放词汇语义分割框架 open-vocabulary open vocabulary
60 Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation 提出DcSConv自监督单目深度估计框架,解决深度变化导致的物体尺度模糊问题。 depth estimation monocular depth
61 Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach 提出基于扩散增强深度恢复的单目深度估计方法,提升特征表达能力。 depth estimation monocular depth
62 SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction SurfelSplat:学习高效且泛化的高斯Surfel表示,用于稀疏视角表面重建 3D gaussian splatting 3DGS gaussian splatting
63 CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning 提出CrashSight:面向交通碰撞场景理解的基础设施视角视频基准 scene understanding visual grounding
64 LINE: LLM-based Iterative Neuron Explanations for Vision Models LINE:基于LLM迭代式神经元解释的视觉模型分析方法 open-vocabulary open vocabulary large language model
65 SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations SceneScribe-1M:大规模几何与语义标注视频数据集,促进3D感知与视频生成融合。 depth estimation monocular depth scene reconstruction
66 ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video ReconPhys:提出单目视频重建外观和物理属性的快速前馈框架 3D gaussian splatting gaussian splatting splatting
67 Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction Scal3R:用于大规模3D重建的可扩展测试时训练方法 scene reconstruction
68 InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding 提出InstAP,通过实例感知预训练提升视觉-语言模型在时空理解上的能力。 scene understanding
69 Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting 提出基于生成3D高斯模型的气象预测框架以解决高分辨率输出问题 3D gaussian splatting gaussian splatting splatting
70 OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation OV-Stitcher:提出全局上下文感知的免训练开放词汇语义分割框架 open-vocabulary open vocabulary
71 CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning 提出CrashSight以解决交通事故场景理解问题 scene understanding visual grounding

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
72 UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding 提出UniversalVTG,一种轻量级通用视频时序定位基础模型 Ego4D foundation model multimodal
73 E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation 提出E-3DPSM,用于事件相机在单目自中心3D人体姿态估计中提升精度与稳定性。 egocentric human motion
74 ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets ETCH-X:通过可组合数据集增强服装人体模型的鲁棒性和表达性 SMPL SMPL-X
75 GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics 提出Skelebones以解决非刚性表面动画控制问题 motion matching

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
76 Coordinate-Based Dual-Constrained Autoregressive Motion Generation 提出基于坐标和双重约束的自回归运动生成框架CDAMD,提升文本到动作生成质量。 text-to-motion motion synthesis motion generation
77 Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics Phantom:通过联合建模视觉和潜在物理动力学实现物理信息注入的视频生成 physically plausible
78 Guiding a Diffusion Model by Swapping Its Tokens 提出Self-Swap Guidance,通过token交换引导扩散模型,提升图像质量和提示对齐性。 classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
79 BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields 提出BLaDA以解决功能性灵巧抓取中的语义与姿态耦合问题 manipulation dexterous manipulation 3D gaussian splatting
80 LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation LAMP:利用图像编辑作为通用3D先验,实现开放世界操作 manipulation reinforcement learning imitation learning
81 Visually-grounded Humanoid Agents 提出基于视觉的人形智能体框架,实现3D场景中自主行为 humanoid embodied AI

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
82 LPM 1.0: Video-based Character Performance Model LPM 1.0:提出基于视频的角色表演模型,解决高表现力、实时推理和身份稳定性三难问题。 interactive character multimodal
83 ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks ImVideoEdit:通过2D空间差异注意力块实现基于图像学习的视频编辑 spatiotemporal
84 Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation Stitch4D:通过时空插值实现稀疏多视角4D城市重建 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
85 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience 3DrawAgent:利用对比经验教LLM在3D空间中进行绘画 spatial relationship large language model
86 EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition 提出EPIR框架,通过高效的token化、集成和表征学习提升微表情识别性能并降低计算复杂度。 spatial relationship

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
87 PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction 提出PolySLGen,用于多人交互中在线多模态听说反应生成 dyadic interaction embodied AI multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页