cs.CV(2026-03-05)

📊 共 43 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models 提出MASQuant,解决多模态大语言模型量化中的模态不对齐和跨模态计算不变性问题 large language model multimodal
2 Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary 评估GPT-5作为多模态临床推理器的能力:一项全景式研究 foundation model multimodal chain-of-thought
3 Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models 提出复杂性感知自适应推理框架,提升VLA模型在复杂任务中的效率与可靠性 vision-language-action VLA
4 NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries NaiLIA:基于密集意图描述和调色板查询的多模态美甲设计检索 foundation model multimodal
5 UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark UniM:一个统一的任意到任意交错多模态基准,旨在推进多模态大语言模型。 large language model multimodal
6 Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild 评估多模态大语言模型在监控场景下零样本异常检测的可靠性,揭示其保守偏见。 large language model multimodal
7 Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline 提出MM-Lifelong数据集与ReMA模型,解决多模态终身理解中的记忆瓶颈与定位崩溃问题。 multimodal
8 Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model Tell2Adapt:利用视觉基础模型实现无源域自适应的统一框架 foundation model
9 VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters 提出VisionPangu,一个17亿参数的紧凑型多模态助手,提升图像细节描述能力。 multimodal
10 Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation 提出FedMEPD框架,解决多模态脑肿瘤分割中模态异构和个性化建模难题 multimodal
11 Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models 提出多范式协同对抗攻击MPCAttack,提升多模态大语言模型对抗样本的迁移性。 large language model
12 Revisiting Shape from Polarization in the Era of Vision Foundation Models 利用高质量偏振数据和领域自适应,轻量模型在单视角表面法向量估计上超越视觉基础模型。 foundation model
13 HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token HALP:无需生成任何token即可检测视觉语言模型中的幻觉 multimodal
14 Layer by layer, module by module: Choose both for optimal OOD probing of ViT 针对ViT,提出层与模块选择性OOD探测方法,优化分布偏移下的性能。 foundation model
15 A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset 提出基于颜色注意力机制的RT-DETR,用于360度多摄像头系统下的蓝色紧急车辆灯光检测。 multimodal
16 MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents MultiHaystack:构建大规模跨模态检索与推理基准,评估MLLM在复杂场景下的性能瓶颈。 large language model multimodal
17 Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection 提出后融合稳定器PFS,提升多模态3D检测在域偏移和传感器失效下的鲁棒性 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
18 Mario: Multimodal Graph Reasoning with Large Language Models 提出Mario框架以解决多模态图推理中的一致性与偏好问题 contrastive learning large language model multimodal
19 Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum Wiki-R1:通过数据和采样课程学习,激励多模态推理以解决知识库VQA问题 reinforcement learning large language model multimodal
20 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding 提出3D-RFT,通过强化学习微调提升视频3D场景理解能力 reinforcement learning scene understanding large language model
21 ICHOR: A Robust Representation Learning Approach for ASL CBF Maps with Self-Supervised Masked Autoencoders 提出ICHOR,一种基于自监督掩码自编码器的ASL CBF图鲁棒表征学习方法。 representation learning masked autoencoder
22 MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis 提出选择性排斥知识蒸馏,用于移动端胎儿超声分析,性能超越大型模型。 distillation foundation model
23 Dark3R: Learning Structure from Motion in the Dark Dark3R:提出一种在极低光照下基于运动恢复结构的框架,突破传统方法限制。 distillation feature matching foundation model
24 Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model 提出CompACT:一种紧凑离散Token编码器,用于加速World Model中的决策规划。 policy learning world model
25 DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization DeformTrace:利用可变形状态空间模型和中继令牌进行时序伪造定位 SSM state space model
26 Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning 提出Prompt-Driven Noise Generation,解决sRGB图像真实噪声生成难题 representation learning
27 Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation 提出DCR方法,通过对比信号引导扩散重建,提升CLIP视觉表征的判别性和细节感知能力。 contrastive learning large language model
28 Interpretable Perception and Reasoning for Audiovisual Geolocation 提出AVG框架,利用可解释的视听感知与推理实现高精度地理定位。 flow matching large language model multimodal
29 When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On 提出隐式错误计数(IEC)方法,解决虚拟试穿等参考答案缺失场景下的RL后训练问题。 reinforcement learning reward design

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
30 Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields 提出基于神经辐射场的LWIR高光谱气体羽流三维场景理解方法 NeRF neural radiance field scene reconstruction
31 SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction 提出SSR-GS,用于高光表面重建中分离高光反射,提升复杂光照下的重建质量。 3D gaussian splatting 3DGS gaussian splatting
32 DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction 提出DSA-SRGS,用于动态稀疏视角DSA重建的超分辨率高斯溅射 gaussian splatting splatting
33 GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction GloSplat:用于更快更精确三维重建的联合姿态-外观优化方法 3D gaussian splatting 3DGS gaussian splatting
34 CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception 提出CATNet,解决协同感知中时延和噪声干扰问题,提升复杂交通场景下的感知性能。 scene understanding
35 FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation 提出FC-VFI,用于高帧率慢动作视频生成中的保真和一致性视频插帧 optical flow
36 Any to Full: Prompting Depth Anything for Depth Completion in One Stage Any2Full:单阶段Prompt深度补全,提升机器人感知精度与效率 depth estimation monocular depth Depth Anything
37 OWL: A Novel Approach to Machine Perception During Motion 提出OWL函数,利用视觉运动线索实现运动中机器感知 scene reconstruction

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
38 Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems 提出数字孪生驱动的纺织品分类与异物识别系统,用于自动化分拣 manipulation dual-arm motion planning
39 Video-based Locomotion Analysis for Fish Health Monitoring 提出基于YOLOv11的多目标跟踪系统,用于鱼类健康监测的运动分析。 locomotion
40 RealWonder: Real-Time Physical Action-Conditioned Video Generation RealWonder:首个基于物理作用条件下的实时视频生成系统 manipulation optical flow

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
41 Accelerating Text-to-Video Generation with Calibrated Sparse Attention CalibAtt:通过校准稀疏注意力加速文本到视频生成 spatiotemporal
42 Orthogonal Spatial-temporal Distributional Transfer for 4D Generation 提出正交时空分布迁移框架Orster,解决4D生成中数据匮乏问题。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
43 SURE: Semi-dense Uncertainty-REfined Feature Matching SURE:提出半稠密不确定性精炼特征匹配框架,提升图像匹配可靠性。 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页