cs.CV(2026-03-02)

📊 共 42 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一:机器人控制 (Robot Control) (5) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification 提出PathMoE以解决儿童脑肿瘤分类中的多模态信息整合问题 foundation model multimodal
2 ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models ATA:通过注意力引导和动作引导推理桥接隐式推理,用于视觉-语言-动作模型 vision-language-action VLA visual grounding
3 Unifying Language-Action Understanding and Generation for Autonomous Driving LinkVLA:统一语言-动作理解与生成,提升自动驾驶指令跟随性能与效率 vision-language-action VLA instruction following
4 Adaptive Confidence Regularization for Multimodal Failure Detection 提出自适应置信度正则化(ACR)框架,用于多模态模型的失效检测。 multimodal
5 Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction 提出DIMAFx框架,用于可解释的解耦多模态癌症生存预测。 multimodal
6 NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis 提出NICO-RAG框架,利用多模态超图检索增强生成,助力理解尼古丁公共健康危机。 multimodal
7 Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications Cryo-Bench:冰冻圈应用领域地理基础模型评测基准 foundation model
8 VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models VidDoS:针对视频大语言模型的通用拒绝服务攻击 large language model
9 DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving 提出DriveCombo基准,评估多模态大模型在自动驾驶中组合交通规则推理能力 large language model multimodal
10 InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning 提出InterCoG框架,通过交错的链式 grounding 推理实现空间精细的图像编辑。 multimodal visual grounding
11 Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory 提出SDAM:一种免训练的时空解耦推理视频分割方法,提升分割稳定性。 large language model multimodal
12 Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance Kiwi-Edit:通过指令和参考引导实现通用视频编辑 instruction following
13 From Pixels to Patches: Pooling Strategies for Earth Embeddings 针对地球观测嵌入,提出更优的像素级嵌入池化策略,提升地理泛化能力。 foundation model
14 MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation MealRec:通过分层扩散模型进行多粒度序列建模,用于微视频推荐 multimodal
15 Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation 提出基于低秩解码器自适应的高效测试时深度补全方法 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
16 LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving 提出LaST-VLA,通过潜在时空推理解决自动驾驶中视觉-语言-动作模型的语义解耦问题。 reinforcement learning world model vision-language-action
17 From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents 提出MM-Mem,通过语义信息瓶颈蒸馏金字塔式多模态记忆,解决长时域视频Agent问题。 distillation large language model multimodal
18 Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation Sketch2Colab:通过可控流蒸馏实现草图驱动的多人动画生成 distillation physically plausible human motion
19 Generative Visual Chain-of-Thought for Image Editing 提出生成式视觉思维链(GVCoT)框架,用于解决图像编辑中复杂场景下的精细化空间指令理解问题。 reinforcement learning chain-of-thought
20 LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation LiftAvatar:通过运动空间补全实现表情控制的3D高斯头像动画 distillation 3D gaussian splatting gaussian splatting
21 WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories WorldStereo:通过3D几何记忆桥接相机引导的视频生成与场景重建 world model scene reconstruction
22 Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration 提出DATPRL-IR,解决多领域全能图像复原问题,提升泛化能力。 representation learning large language model multimodal
23 Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference 提出Preference Score Distillation (PSD),利用2D奖励模型对齐文本到3D生成的人类偏好。 distillation classifier-free guidance
24 Towards Principled Dataset Distillation: A Spectral Distribution Perspective 提出类感知谱分布匹配(CSDM)方法,解决数据集蒸馏在长尾数据集上的性能退化问题。 distillation
25 Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning 提出跨模态身份映射(CIM),通过强化学习最小化模态转换中的信息损失,提升图像描述质量。 reinforcement learning
26 MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention MixerCSeg:通过解耦Mamba注意力机制的高效裂缝分割混合器架构 Mamba
27 CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions CoopDiff:基于扩散模型的协同感知框架,提升在多种退化条件下的鲁棒性 teacher-student scene understanding

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
28 Sparse View Distractor-Free Gaussian Splatting 提出基于先验信息的稀疏视图无干扰高斯溅射方法 3D gaussian splatting 3DGS gaussian splatting
29 Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera 提出Stereo-Inertial Poser,利用双目相机和稀疏IMU实现高精度、体型感知的运动捕捉 monocular depth foot skating human motion
30 WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments WildCross:用于自然环境场景识别和度量深度估计的跨模态大规模基准 depth estimation metric depth scene understanding
31 OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution OnlineX:提出主动-稳定状态演化,实现统一的在线3D重建与理解 3D gaussian splatting 3DGS gaussian splatting
32 SimRecon: SimReady Compositional Scene Reconstruction from Real Videos SimRecon:提出一种从真实视频重建可用于仿真的组合场景方法 scene reconstruction
33 PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts PromptStereo:通过结构和运动提示实现零样本立体匹配 monocular depth foundation model
34 Radiometrically Consistent Gaussian Surfels for Inverse Rendering 提出基于辐射一致性高斯Surfels的逆渲染方法RadioGS,解决间接光照建模难题。 gaussian splatting splatting
35 TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding TopoMaskV3通过密集偏移和高度预测实现道路拓扑理解的3D掩码头部,显著提升性能。 height map

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
36 MVR: Multi-view Video Reward Shaping for Reinforcement Learning 提出多视角视频奖励塑造(MVR)框架,提升强化学习在复杂运动任务中的性能。 humanoid humanoid locomotion locomotion
37 Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection 提出REFORM框架,通过推理过程建模提升多模态篡改检测的泛化性 manipulation reinforcement learning multimodal
38 Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation Pri4R:利用特权4D表示学习世界动力学,提升视觉-语言-动作模型的操作性能 manipulation spatiotemporal vision-language-action
39 ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks 提出基于循环一致GAN的ORGAN,用于无监督对象中心表示学习,尤其擅长复杂真实场景。 manipulation representation learning
40 DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis 提出DOCFORGE-BENCH以解决文档伪造检测的评估问题 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
41 Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models 提出基于局部与全局上下文优化的Token减少方法以提升视频大语言模型效率 spatiotemporal large language model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
42 LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization 提出LEAR框架,利用事件相机进行边缘感知LiDAR定位。 motion representation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页