cs.CV(2026-04-08)

📊 共 37 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (10 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (9 🔗4) 支柱八:物理动画 (Physics-based Animation) (4 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
1 Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models Q-Zoom:面向高效多模态大语言模型的查询感知自适应感知框架 distillation large language model multimodal
2 Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization 提出MAPO,弥合多模态Agent中推理与行动的差距,提升图像理解能力 reinforcement learning large language model multimodal
3 FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching FlowInOne:提出统一的多模态生成框架,将所有模态转化为视觉流,实现图像输入/输出。 flow matching multimodal instruction following
4 INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling INSPATIO-WORLD:基于时空自回归建模的实时4D世界模拟器 world model world models distillation
5 BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment 提出BRIDGE,通过强化学习对齐多模态查询,提升文本语料库上的跨模态检索性能。 reinforcement learning multimodal
6 URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection 提出URMF,通过不确定性感知的多模态融合提升多模态讽刺检测的鲁棒性。 contrastive learning multimodal
7 Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning 提出基于MLLM元推理的无训练声源定位框架GAR,解决复杂场景下的定位难题 contrastive learning feature matching large language model
8 Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training 提出基于自监督预训练的稀疏ViT模型,用于能量前沿异构中微子探测器 masked autoencoder physically plausible multimodal
9 Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction 提出基于轻量级Mamba的MARMamba模型,用于CT金属伪影高效去除。 Mamba
10 VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography VAMAE:血管感知掩码自编码器用于OCT血管造影图像的自监督预训练 masked autoencoder

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
11 AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors AnchorSplat:提出基于3D几何先验的Feed-Forward高斯溅射方法,用于场景级重建。 3D gaussian splatting 3DGS gaussian splatting
12 DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting 提出DOC-GS框架,通过双域观测与校准提升稀疏视角下高斯溅射的重建质量。 3D gaussian splatting 3DGS gaussian splatting
13 LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation 提出基于提升理论和帧理论的LiftFormer,用于单目深度估计,提升边缘区域深度预测精度。 depth estimation monocular depth metric depth
14 VGGT-SLAM++ VGGT-SLAM++:融合VGGT几何信息的精确、高效、可扩展视觉SLAM系统 visual odometry visual SLAM elevation map
15 From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians 提出基于带方向高斯体的表面重建方法,解决3DGS表面提取难题 3D gaussian splatting 3DGS gaussian splatting
16 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis 提出基于4D高斯溅射的血管重建方法,用于体外血栓切除术分析 gaussian splatting splatting
17 Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training Mem3R:通过测试时训练和混合记忆实现流式3D重建,提升长序列一致性。 depth estimation
18 Synthetic Dataset Generation for Partially Observed Indoor Objects 提出基于Unity的虚拟扫描框架,用于生成部分观测室内物体的合成数据集。 scene reconstruction
19 LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video LiveStre4m:一种从无位姿多视角视频实时生成新视角的Feed-Forward方法 scene reconstruction

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
20 Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation 提出基于部件的多模态知识增强方法,用于甲骨文释读 multimodal visual grounding
21 BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving BATON:自然驾驶中双向自动化切换观察的多模态基准数据集 multimodal
22 DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification 提出DINO-QPM,提升视觉基础模型分类精度与全局可解释性 foundation model
23 USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification USCNet:基于Transformer的多模态融合与分割引导的尿路结石分类 multimodal
24 RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details RefineAnything:多模态区域精细化修复,实现完美局部细节重建 multimodal
25 Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning 提出基于主动3D场景探索的MLLM空间理解增强框架,用于多视角推理 large language model multimodal chain-of-thought
26 Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation 提出对抗性走私攻击,揭示MLLM内容审核中的安全漏洞 large language model multimodal
27 RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection 提出RASR框架,通过检索增强语义推理提升虚假新闻视频检测性能。 large language model multimodal
28 ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation ModuSeg:解耦对象发现与语义检索,实现免训练弱监督语义分割 foundation model

🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)

#题目一句话要点标签🔗
29 Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data 提出LIANet:一种基于坐标的地球观测数据时空神经表示方法 spatiotemporal foundation model
30 EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling EventFace:通过结构驱动的时空建模实现基于事件的人脸识别 spatiotemporal
31 Fast Spatial Memory with Elastic Test-Time Training 提出基于弹性测试时训练的快速空间记忆,用于长序列4D重建。 spatiotemporal
32 Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer 提出OG-ReG Transformer,模拟人类视觉认知,提升视频动作理解能力 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 MoRight: Motion Control Done Right MoRight:提出解耦运动控制框架,实现可控且因果一致的视频生成。 physically plausible
34 Not all tokens contribute equally to diffusion learning DARE:通过分布感知修正和空间集成提升扩散模型中的语义引导,优化文本到视频生成。 classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
35 PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing PhyEdit:通过物理约束的图像编辑实现真实世界物体操作 manipulation world model world models

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
36 Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency 提出熵引导的尺度自适应和流场局部一致性方法,提升局部特征匹配性能 feature matching

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
37 CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation 提出基于耦合WarpRNN的隐式神经视频表示方法CWRNN-INVR,提升视频重建质量。 motion representation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页