cs.CV（2026-03-03）

📊 共 47 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (18 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval	提出TRACE，通过任务自适应推理和表征学习实现通用多模态检索	representation learning large language model multimodal
2	Chain of World: World Model Thinking in Latent Motion	提出Chain-of-World VLA模型，解决具身智能中视觉动态预测与时序因果建模问题。	world model latent dynamics motion latent	✅
3	VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning	提出VSeacher，通过强化学习赋能多模态模型，使其具备长程多轮Web搜索能力。	reinforcement learning large language model multimodal
4	MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization	提出MoD-DPO，通过解耦模态偏好优化缓解全模态LLM中的跨模态幻觉问题	DPO direct preference optimization large language model
5	Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation	提出通用知识蒸馏GKD，提升语义分割模型在跨域泛化能力	representation learning distillation foundation model	✅
6	Beyond Language Modeling: An Exploration of Multimodal Pretraining	探索多模态预训练：超越语言建模，实现视觉与语言的协同	world model foundation model multimodal
7	Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting	MVD-HuGaS：基于多视角扩散模型和高斯溅射的单图三维人体重建	distillation gaussian splatting splatting
8	Kling-MotionControl Technical Report	Kling-MotionControl：基于DiT的统一框架，实现鲁棒、精确、富有表现力的人物动画	distillation motion retargeting motion representation
9	Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective	提出IB-IUMAD，解决增量统一多模态异常检测中的灾难性遗忘问题	Mamba multimodal
10	SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data	提出SGMA框架，解决遥感不完整多模态数据语义分割中的模态不平衡问题。	contrastive learning multimodal
11	Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing	提出RL3DEdit，通过几何引导强化学习实现多视角一致的三维场景编辑	reinforcement learning VGGT foundation model
12	CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration	提出CAWM-Mamba，用于红外-可见光图像融合和复杂恶劣天气恢复的统一模型	Mamba SSM multimodal	✅
13	Specificity-aware reinforcement learning for fine-grained open-world classification	提出SpeciaRL，解决开放世界细粒度分类中LMMs预测泛化问题	reinforcement learning multimodal	✅
14	From "What" to "How": Constrained Reasoning for Autoregressive Image Generation	提出CoR-Painter，通过约束推理指导自回归图像生成，解决空间歧义问题。	reinforcement learning spatial relationship chain-of-thought
15	ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling	ShareVerse：提出多智能体一致性视频生成框架，用于共享世界建模	world model geometric consistency
16	DREAM: Where Visual Understanding Meets Text-to-Image Generation	DREAM：融合视觉理解与文本到图像生成的统一框架	representation learning depth estimation multimodal
17	ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion	ITO：通过协同多重对齐和训练时融合，实现图像和文本的统一表示	representation learning contrastive learning multimodal
18	NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining	NeighborMAE：利用邻域遥感影像空间依赖性的掩码自编码器预训练	masked autoencoder

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing	提出RADAR：一种免训练方法，缓解多模态LLM在遥感场景中的幻觉问题	large language model multimodal visual grounding	✅
20	LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval	LLandMark：面向地标感知的多模态交互视频检索多智能体框架	large language model multimodal
21	UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?	UniG2U-Bench：评估统一模型在多模态理解中生成能力对理解能力的提升。	multimodal
22	BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology	BRIGHT：用于乳腺病理学的通用-专用协作式基础模型	foundation model
23	Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention	提出基于基础模型合成和Wavelet域注意力的异常检测方法，提升工业异常检测性能。	foundation model
24	GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights	GloPath：用于肾小球病变评估和临床病理学洞察的实体中心基础模型	foundation model
25	Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models	提出Think-as-You-See以解决视频流推理问题	chain-of-thought	✅
26	iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding	iGVLM：动态指令引导的视觉编码，用于问题感知的多模态理解	multimodal
27	On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding	针对动作理解，提出生成辅助判别分类器(GAD)，提升多模态大语言模型性能与效率。	large language model multimodal
28	TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation	TagaVLM：提出拓扑感知全局动作推理框架，提升视觉语言导航性能	VLN	✅
29	MIBURI: Towards Expressive Interactive Gesture Synthesis	MIBURI：提出一种用于生成富有表现力的交互式手势的在线因果框架。	large language model
30	LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory	LoGeR：利用混合记忆模块实现长时序视频几何重建	foundation model
31	3D-DRES: Detailed 3D Referring Expression Segmentation	提出3D-DRES任务和DetailRefer数据集，用于细粒度3D指代表达式分割。	visual grounding
32	Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs	提出VC-STaR框架，利用视觉对比提升视觉语言模型中的推理能力	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
33	SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding	SemGS：基于稀疏视角的通用语义3D高斯溅射前馈网络，用于可泛化的场景理解	3D gaussian splatting gaussian splatting splatting
34	Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis	提出多模态先验引导的重要性采样，用于稀疏视角下的层级高斯溅射新视角合成。	3D gaussian splatting 3DGS gaussian splatting
35	HDINO: A Concise and Efficient Open-Vocabulary Detector	提出HDINO，一种简洁高效的开放词汇目标检测器，无需人工标注和密集跨模态特征提取。	open-vocabulary open vocabulary	✅
36	VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats	提出VIRGi以解决3D场景快速重色问题	3D gaussian splatting 3DGS gaussian splatting
37	R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild	R3GW：提出可重光照的3D高斯模型，用于重建和渲染真实户外场景。	3D gaussian splatting 3DGS gaussian splatting
38	Any Resolution Any Geometry: From Multi-View To Multi-Patch	提出超高分辨率几何Transformer，用于单目高分辨率深度和法向量联合估计。	scene understanding VGGT
39	Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels	Track4World：提出一种前馈世界坐标系下的像素级稠密3D跟踪方法	scene flow VGGT
40	Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement	提出AiM框架，通过动态-静态解耦实现无先验知识的运动铰接物体部件分析	3DGS	✅
41	Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction	NEMF：用于高分辨率材料参数重建的神经电磁场方法	NeRF

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
42	DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction	DuoMo：双重运动扩散模型，用于世界坐标系下的人体运动重建	motion diffusion foot skating human motion	✅
43	COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design	COP-GEN：用于哥白尼地球观测数据的隐空间扩散Transformer生成模型	physically plausible multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
44	NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing	NOVA：稀疏控制与稠密合成，用于无配对视频编辑	motion reconstruction
45	Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild	DrPose：通过姿态直接奖励微调，提升单图到3D人体重建的自然度	human motion	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
46	Utonia: Toward One Encoder for All Point Clouds	Utonia：面向所有点云的统一Transformer编码器，实现跨域知识迁移	manipulation vision-language-action foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation	提出Synthetic-Child，利用AIGC生成合成数据，解决儿童姿态估计中的隐私问题。	SMPL SMPL-X

⬅️ 返回 cs.CV 首页 · 🏠 返回主页