cs.CV（2026-03-02）

📊 共 42 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (5) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification	提出PathMoE以解决儿童脑肿瘤分类中的多模态信息整合问题	foundation model multimodal
2	ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models	ATA：通过注意力引导和动作引导推理桥接隐式推理，用于视觉-语言-动作模型	vision-language-action VLA visual grounding
3	Unifying Language-Action Understanding and Generation for Autonomous Driving	LinkVLA：统一语言-动作理解与生成，提升自动驾驶指令跟随性能与效率	vision-language-action VLA instruction following
4	Adaptive Confidence Regularization for Multimodal Failure Detection	提出自适应置信度正则化(ACR)框架，用于多模态模型的失效检测。	multimodal	✅
5	Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction	提出DIMAFx框架，用于可解释的解耦多模态癌症生存预测。	multimodal
6	NICO-RAG: Multimodal Hypergraph Retrieval-Augmented Generation for Understanding the Nicotine Public Health Crisis	提出NICO-RAG框架，利用多模态超图检索增强生成，助力理解尼古丁公共健康危机。	multimodal
7	Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications	Cryo-Bench：冰冻圈应用领域地理基础模型评测基准	foundation model	✅
8	VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models	VidDoS：针对视频大语言模型的通用拒绝服务攻击	large language model
9	DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving	提出DriveCombo基准，评估多模态大模型在自动驾驶中组合交通规则推理能力	large language model multimodal
10	InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning	提出InterCoG框架，通过交错的链式 grounding 推理实现空间精细的图像编辑。	multimodal visual grounding
11	Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory	提出SDAM：一种免训练的时空解耦推理视频分割方法，提升分割稳定性。	large language model multimodal
12	Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance	Kiwi-Edit：通过指令和参考引导实现通用视频编辑	instruction following	✅
13	From Pixels to Patches: Pooling Strategies for Earth Embeddings	针对地球观测嵌入，提出更优的像素级嵌入池化策略，提升地理泛化能力。	foundation model
14	MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation	MealRec：通过分层扩散模型进行多粒度序列建模，用于微视频推荐	multimodal
15	Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation	提出基于低秩解码器自适应的高效测试时深度补全方法	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
16	LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving	提出LaST-VLA，通过潜在时空推理解决自动驾驶中视觉-语言-动作模型的语义解耦问题。	reinforcement learning world model vision-language-action
17	From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents	提出MM-Mem，通过语义信息瓶颈蒸馏金字塔式多模态记忆，解决长时域视频Agent问题。	distillation large language model multimodal	✅
18	Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation	Sketch2Colab：通过可控流蒸馏实现草图驱动的多人动画生成	distillation physically plausible human motion
19	Generative Visual Chain-of-Thought for Image Editing	提出生成式视觉思维链（GVCoT）框架，用于解决图像编辑中复杂场景下的精细化空间指令理解问题。	reinforcement learning chain-of-thought
20	LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation	LiftAvatar：通过运动空间补全实现表情控制的3D高斯头像动画	distillation 3D gaussian splatting gaussian splatting
21	WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories	WorldStereo：通过3D几何记忆桥接相机引导的视频生成与场景重建	world model scene reconstruction
22	Learning Domain-Aware Task Prompt Representations for Multi-Domain All-in-One Image Restoration	提出DATPRL-IR，解决多领域全能图像复原问题，提升泛化能力。	representation learning large language model multimodal	✅
23	Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference	提出Preference Score Distillation (PSD)，利用2D奖励模型对齐文本到3D生成的人类偏好。	distillation classifier-free guidance
24	Towards Principled Dataset Distillation: A Spectral Distribution Perspective	提出类感知谱分布匹配（CSDM）方法，解决数据集蒸馏在长尾数据集上的性能退化问题。	distillation
25	Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning	提出跨模态身份映射（CIM），通过强化学习最小化模态转换中的信息损失，提升图像描述质量。	reinforcement learning
26	MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention	MixerCSeg：通过解耦Mamba注意力机制的高效裂缝分割混合器架构	Mamba	✅
27	CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions	CoopDiff：基于扩散模型的协同感知框架，提升在多种退化条件下的鲁棒性	teacher-student scene understanding

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Sparse View Distractor-Free Gaussian Splatting	提出基于先验信息的稀疏视图无干扰高斯溅射方法	3D gaussian splatting 3DGS gaussian splatting
29	Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera	提出Stereo-Inertial Poser，利用双目相机和稀疏IMU实现高精度、体型感知的运动捕捉	monocular depth foot skating human motion
30	WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments	WildCross：用于自然环境场景识别和度量深度估计的跨模态大规模基准	depth estimation metric depth scene understanding	✅
31	OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution	OnlineX：提出主动-稳定状态演化，实现统一的在线3D重建与理解	3D gaussian splatting 3DGS gaussian splatting
32	SimRecon: SimReady Compositional Scene Reconstruction from Real Videos	SimRecon：提出一种从真实视频重建可用于仿真的组合场景方法	scene reconstruction
33	PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts	PromptStereo：通过结构和运动提示实现零样本立体匹配	monocular depth foundation model
34	Radiometrically Consistent Gaussian Surfels for Inverse Rendering	提出基于辐射一致性高斯Surfels的逆渲染方法RadioGS，解决间接光照建模难题。	gaussian splatting splatting
35	TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding	TopoMaskV3通过密集偏移和高度预测实现道路拓扑理解的3D掩码头部，显著提升性能。	height map

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
36	MVR: Multi-view Video Reward Shaping for Reinforcement Learning	提出多视角视频奖励塑造(MVR)框架，提升强化学习在复杂运动任务中的性能。	humanoid humanoid locomotion locomotion
37	Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection	提出REFORM框架，通过推理过程建模提升多模态篡改检测的泛化性	manipulation reinforcement learning multimodal
38	Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation	Pri4R：利用特权4D表示学习世界动力学，提升视觉-语言-动作模型的操作性能	manipulation spatiotemporal vision-language-action
39	ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks	提出基于循环一致GAN的ORGAN，用于无监督对象中心表示学习，尤其擅长复杂真实场景。	manipulation representation learning
40	DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis	提出DOCFORGE-BENCH以解决文档伪造检测的评估问题	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models	提出基于局部与全局上下文优化的Token减少方法以提升视频大语言模型效率	spatiotemporal large language model	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
42	LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization	提出LEAR框架，利用事件相机进行边缘感知LiDAR定位。	motion representation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页