cs.CV(2026-03-03)

📊 共 47 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (18 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (18 篇)

#题目一句话要点标签🔗
1 TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval 提出TRACE,通过任务自适应推理和表征学习实现通用多模态检索 representation learning large language model multimodal
2 Chain of World: World Model Thinking in Latent Motion 提出Chain-of-World VLA模型,解决具身智能中视觉动态预测与时序因果建模问题。 world model latent dynamics motion latent
3 VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning 提出VSeacher,通过强化学习赋能多模态模型,使其具备长程多轮Web搜索能力。 reinforcement learning large language model multimodal
4 MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization 提出MoD-DPO,通过解耦模态偏好优化缓解全模态LLM中的跨模态幻觉问题 DPO direct preference optimization large language model
5 Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation 提出通用知识蒸馏GKD,提升语义分割模型在跨域泛化能力 representation learning distillation foundation model
6 Beyond Language Modeling: An Exploration of Multimodal Pretraining 探索多模态预训练:超越语言建模,实现视觉与语言的协同 world model foundation model multimodal
7 Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting MVD-HuGaS:基于多视角扩散模型和高斯溅射的单图三维人体重建 distillation gaussian splatting splatting
8 Kling-MotionControl Technical Report Kling-MotionControl:基于DiT的统一框架,实现鲁棒、精确、富有表现力的人物动画 distillation motion retargeting motion representation
9 Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective 提出IB-IUMAD,解决增量统一多模态异常检测中的灾难性遗忘问题 Mamba multimodal
10 SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data 提出SGMA框架,解决遥感不完整多模态数据语义分割中的模态不平衡问题。 contrastive learning multimodal
11 Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing 提出RL3DEdit,通过几何引导强化学习实现多视角一致的三维场景编辑 reinforcement learning VGGT foundation model
12 CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration 提出CAWM-Mamba,用于红外-可见光图像融合和复杂恶劣天气恢复的统一模型 Mamba SSM multimodal
13 Specificity-aware reinforcement learning for fine-grained open-world classification 提出SpeciaRL,解决开放世界细粒度分类中LMMs预测泛化问题 reinforcement learning multimodal
14 From "What" to "How": Constrained Reasoning for Autoregressive Image Generation 提出CoR-Painter,通过约束推理指导自回归图像生成,解决空间歧义问题。 reinforcement learning spatial relationship chain-of-thought
15 ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling ShareVerse:提出多智能体一致性视频生成框架,用于共享世界建模 world model geometric consistency
16 DREAM: Where Visual Understanding Meets Text-to-Image Generation DREAM:融合视觉理解与文本到图像生成的统一框架 representation learning depth estimation multimodal
17 ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion ITO:通过协同多重对齐和训练时融合,实现图像和文本的统一表示 representation learning contrastive learning multimodal
18 NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining NeighborMAE:利用邻域遥感影像空间依赖性的掩码自编码器预训练 masked autoencoder

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
19 Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing 提出RADAR:一种免训练方法,缓解多模态LLM在遥感场景中的幻觉问题 large language model multimodal visual grounding
20 LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval LLandMark:面向地标感知的多模态交互视频检索多智能体框架 large language model multimodal
21 UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? UniG2U-Bench:评估统一模型在多模态理解中生成能力对理解能力的提升。 multimodal
22 BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology BRIGHT:用于乳腺病理学的通用-专用协作式基础模型 foundation model
23 Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention 提出基于基础模型合成和Wavelet域注意力的异常检测方法,提升工业异常检测性能。 foundation model
24 GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights GloPath:用于肾小球病变评估和临床病理学洞察的实体中心基础模型 foundation model
25 Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models 提出Think-as-You-See以解决视频流推理问题 chain-of-thought
26 iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding iGVLM:动态指令引导的视觉编码,用于问题感知的多模态理解 multimodal
27 On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding 针对动作理解,提出生成辅助判别分类器(GAD),提升多模态大语言模型性能与效率。 large language model multimodal
28 TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation TagaVLM:提出拓扑感知全局动作推理框架,提升视觉语言导航性能 VLN
29 MIBURI: Towards Expressive Interactive Gesture Synthesis MIBURI:提出一种用于生成富有表现力的交互式手势的在线因果框架。 large language model
30 LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory LoGeR:利用混合记忆模块实现长时序视频几何重建 foundation model
31 3D-DRES: Detailed 3D Referring Expression Segmentation 提出3D-DRES任务和DetailRefer数据集,用于细粒度3D指代表达式分割。 visual grounding
32 Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs 提出VC-STaR框架,利用视觉对比提升视觉语言模型中的推理能力 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
33 SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding SemGS:基于稀疏视角的通用语义3D高斯溅射前馈网络,用于可泛化的场景理解 3D gaussian splatting gaussian splatting splatting
34 Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis 提出多模态先验引导的重要性采样,用于稀疏视角下的层级高斯溅射新视角合成。 3D gaussian splatting 3DGS gaussian splatting
35 HDINO: A Concise and Efficient Open-Vocabulary Detector 提出HDINO,一种简洁高效的开放词汇目标检测器,无需人工标注和密集跨模态特征提取。 open-vocabulary open vocabulary
36 VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats 提出VIRGi以解决3D场景快速重色问题 3D gaussian splatting 3DGS gaussian splatting
37 R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild R3GW:提出可重光照的3D高斯模型,用于重建和渲染真实户外场景。 3D gaussian splatting 3DGS gaussian splatting
38 Any Resolution Any Geometry: From Multi-View To Multi-Patch 提出超高分辨率几何Transformer,用于单目高分辨率深度和法向量联合估计。 scene understanding VGGT
39 Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels Track4World:提出一种前馈世界坐标系下的像素级稠密3D跟踪方法 scene flow VGGT
40 Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement 提出AiM框架,通过动态-静态解耦实现无先验知识的运动铰接物体部件分析 3DGS
41 Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction NEMF:用于高分辨率材料参数重建的神经电磁场方法 NeRF

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
42 DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction DuoMo:双重运动扩散模型,用于世界坐标系下的人体运动重建 motion diffusion foot skating human motion
43 COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design COP-GEN:用于哥白尼地球观测数据的隐空间扩散Transformer生成模型 physically plausible multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
44 NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing NOVA:稀疏控制与稠密合成,用于无配对视频编辑 motion reconstruction
45 Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild DrPose:通过姿态直接奖励微调,提升单图到3D人体重建的自然度 human motion

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
46 Utonia: Toward One Encoder for All Point Clouds Utonia:面向所有点云的统一Transformer编码器,实现跨域知识迁移 manipulation vision-language-action foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
47 Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation 提出Synthetic-Child,利用AIGC生成合成数据,解决儿童姿态估计中的隐私问题。 SMPL SMPL-X

⬅️ 返回 cs.CV 首页 · 🏠 返回主页