cs.CV（2026-04-08）

📊 共 37 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (10 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (9 🔗4) 支柱八：物理动画 (Physics-based Animation) (4 🔗2) 支柱四：生成式动作 (Generative Motion) (2) 支柱一：机器人控制 (Robot Control) (1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models	Q-Zoom：面向高效多模态大语言模型的查询感知自适应感知框架	distillation large language model multimodal	✅
2	Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization	提出MAPO，弥合多模态Agent中推理与行动的差距，提升图像理解能力	reinforcement learning large language model multimodal
3	FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching	FlowInOne：提出统一的多模态生成框架，将所有模态转化为视觉流，实现图像输入/输出。	flow matching multimodal instruction following
4	INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling	INSPATIO-WORLD：基于时空自回归建模的实时4D世界模拟器	world model world models distillation
5	BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment	提出BRIDGE，通过强化学习对齐多模态查询，提升文本语料库上的跨模态检索性能。	reinforcement learning multimodal	✅
6	URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection	提出URMF，通过不确定性感知的多模态融合提升多模态讽刺检测的鲁棒性。	contrastive learning multimodal
7	Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning	提出基于MLLM元推理的无训练声源定位框架GAR，解决复杂场景下的定位难题	contrastive learning feature matching large language model	✅
8	Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training	提出基于自监督预训练的稀疏ViT模型，用于能量前沿异构中微子探测器	masked autoencoder physically plausible multimodal
9	Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction	提出基于轻量级Mamba的MARMamba模型，用于CT金属伪影高效去除。	Mamba	✅
10	VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography	VAMAE：血管感知掩码自编码器用于OCT血管造影图像的自监督预训练	masked autoencoder

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
11	AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors	AnchorSplat：提出基于3D几何先验的Feed-Forward高斯溅射方法，用于场景级重建。	3D gaussian splatting 3DGS gaussian splatting
12	DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting	提出DOC-GS框架，通过双域观测与校准提升稀疏视角下高斯溅射的重建质量。	3D gaussian splatting 3DGS gaussian splatting
13	LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation	提出基于提升理论和帧理论的LiftFormer，用于单目深度估计，提升边缘区域深度预测精度。	depth estimation monocular depth metric depth
14	VGGT-SLAM++	VGGT-SLAM++：融合VGGT几何信息的精确、高效、可扩展视觉SLAM系统	visual odometry visual SLAM elevation map
15	From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians	提出基于带方向高斯体的表面重建方法，解决3DGS表面提取难题	3D gaussian splatting 3DGS gaussian splatting
16	4D Vessel Reconstruction for Benchtop Thrombectomy Analysis	提出基于4D高斯溅射的血管重建方法，用于体外血栓切除术分析	gaussian splatting splatting	✅
17	Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training	Mem3R：通过测试时训练和混合记忆实现流式3D重建，提升长序列一致性。	depth estimation	✅
18	Synthetic Dataset Generation for Partially Observed Indoor Objects	提出基于Unity的虚拟扫描框架，用于生成部分观测室内物体的合成数据集。	scene reconstruction
19	LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video	LiveStre4m：一种从无位姿多视角视频实时生成新视角的Feed-Forward方法	scene reconstruction	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
20	Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation	提出基于部件的多模态知识增强方法，用于甲骨文释读	multimodal visual grounding
21	BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving	BATON：自然驾驶中双向自动化切换观察的多模态基准数据集	multimodal
22	DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification	提出DINO-QPM，提升视觉基础模型分类精度与全局可解释性	foundation model
23	USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification	USCNet：基于Transformer的多模态融合与分割引导的尿路结石分类	multimodal	✅
24	RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details	RefineAnything：多模态区域精细化修复，实现完美局部细节重建	multimodal	✅
25	Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning	提出基于主动3D场景探索的MLLM空间理解增强框架，用于多视角推理	large language model multimodal chain-of-thought
26	Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation	提出对抗性走私攻击，揭示MLLM内容审核中的安全漏洞	large language model multimodal	✅
27	RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection	提出RASR框架，通过检索增强语义推理提升虚假新闻视频检测性能。	large language model multimodal
28	ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation	ModuSeg：解耦对象发现与语义检索，实现免训练弱监督语义分割	foundation model	✅

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data	提出LIANet：一种基于坐标的地球观测数据时空神经表示方法	spatiotemporal foundation model	✅
30	EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling	EventFace：通过结构驱动的时空建模实现基于事件的人脸识别	spatiotemporal
31	Fast Spatial Memory with Elastic Test-Time Training	提出基于弹性测试时训练的快速空间记忆，用于长序列4D重建。	spatiotemporal
32	Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer	提出OG-ReG Transformer，模拟人类视觉认知，提升视频动作理解能力	spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	MoRight: Motion Control Done Right	MoRight：提出解耦运动控制框架，实现可控且因果一致的视频生成。	physically plausible
34	Not all tokens contribute equally to diffusion learning	DARE：通过分布感知修正和空间集成提升扩散模型中的语义引导，优化文本到视频生成。	classifier-free guidance

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing	PhyEdit：通过物理约束的图像编辑实现真实世界物体操作	manipulation world model world models

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency	提出熵引导的尺度自适应和流场局部一致性方法，提升局部特征匹配性能	feature matching

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation	提出基于耦合WarpRNN的隐式神经视频表示方法CWRNN-INVR，提升视频重建质量。	motion representation	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页