cs.CV（2026-03-05）

📊 共 43 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (17 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (2) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models	提出MASQuant，解决多模态大语言模型量化中的模态不对齐和跨模态计算不变性问题	large language model multimodal	✅
2	Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary	评估GPT-5作为多模态临床推理器的能力：一项全景式研究	foundation model multimodal chain-of-thought
3	Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models	提出复杂性感知自适应推理框架，提升VLA模型在复杂任务中的效率与可靠性	vision-language-action VLA
4	NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries	NaiLIA：基于密集意图描述和调色板查询的多模态美甲设计检索	foundation model multimodal
5	UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark	UniM：一个统一的任意到任意交错多模态基准，旨在推进多模态大语言模型。	large language model multimodal	✅
6	Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild	评估多模态大语言模型在监控场景下零样本异常检测的可靠性，揭示其保守偏见。	large language model multimodal
7	Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline	提出MM-Lifelong数据集与ReMA模型，解决多模态终身理解中的记忆瓶颈与定位崩溃问题。	multimodal
8	Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model	Tell2Adapt：利用视觉基础模型实现无源域自适应的统一框架	foundation model	✅
9	VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters	提出VisionPangu，一个17亿参数的紧凑型多模态助手，提升图像细节描述能力。	multimodal
10	Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation	提出FedMEPD框架，解决多模态脑肿瘤分割中模态异构和个性化建模难题	multimodal
11	Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models	提出多范式协同对抗攻击MPCAttack，提升多模态大语言模型对抗样本的迁移性。	large language model	✅
12	Revisiting Shape from Polarization in the Era of Vision Foundation Models	利用高质量偏振数据和领域自适应，轻量模型在单视角表面法向量估计上超越视觉基础模型。	foundation model
13	HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token	HALP：无需生成任何token即可检测视觉语言模型中的幻觉	multimodal
14	Layer by layer, module by module: Choose both for optimal OOD probing of ViT	针对ViT，提出层与模块选择性OOD探测方法，优化分布偏移下的性能。	foundation model
15	A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset	提出基于颜色注意力机制的RT-DETR，用于360度多摄像头系统下的蓝色紧急车辆灯光检测。	multimodal
16	MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents	MultiHaystack：构建大规模跨模态检索与推理基准，评估MLLM在复杂场景下的性能瓶颈。	large language model multimodal
17	Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection	提出后融合稳定器PFS，提升多模态3D检测在域偏移和传感器失效下的鲁棒性	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Mario: Multimodal Graph Reasoning with Large Language Models	提出Mario框架以解决多模态图推理中的一致性与偏好问题	contrastive learning large language model multimodal	✅
19	Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum	Wiki-R1：通过数据和采样课程学习，激励多模态推理以解决知识库VQA问题	reinforcement learning large language model multimodal	✅
20	3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding	提出3D-RFT，通过强化学习微调提升视频3D场景理解能力	reinforcement learning scene understanding large language model
21	ICHOR: A Robust Representation Learning Approach for ASL CBF Maps with Self-Supervised Masked Autoencoders	提出ICHOR，一种基于自监督掩码自编码器的ASL CBF图鲁棒表征学习方法。	representation learning masked autoencoder
22	MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis	提出选择性排斥知识蒸馏，用于移动端胎儿超声分析，性能超越大型模型。	distillation foundation model	✅
23	Dark3R: Learning Structure from Motion in the Dark	Dark3R：提出一种在极低光照下基于运动恢复结构的框架，突破传统方法限制。	distillation feature matching foundation model
24	Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model	提出CompACT：一种紧凑离散Token编码器，用于加速World Model中的决策规划。	policy learning world model
25	DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization	DeformTrace：利用可变形状态空间模型和中继令牌进行时序伪造定位	SSM state space model
26	Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning	提出Prompt-Driven Noise Generation，解决sRGB图像真实噪声生成难题	representation learning
27	Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation	提出DCR方法，通过对比信号引导扩散重建，提升CLIP视觉表征的判别性和细节感知能力。	contrastive learning large language model	✅
28	Interpretable Perception and Reasoning for Audiovisual Geolocation	提出AVG框架，利用可解释的视听感知与推理实现高精度地理定位。	flow matching large language model multimodal
29	When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On	提出隐式错误计数(IEC)方法，解决虚拟试穿等参考答案缺失场景下的RL后训练问题。	reinforcement learning reward design

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields	提出基于神经辐射场的LWIR高光谱气体羽流三维场景理解方法	NeRF neural radiance field scene reconstruction
31	SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction	提出SSR-GS，用于高光表面重建中分离高光反射，提升复杂光照下的重建质量。	3D gaussian splatting 3DGS gaussian splatting
32	DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction	提出DSA-SRGS，用于动态稀疏视角DSA重建的超分辨率高斯溅射	gaussian splatting splatting
33	GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction	GloSplat：用于更快更精确三维重建的联合姿态-外观优化方法	3D gaussian splatting 3DGS gaussian splatting
34	CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception	提出CATNet，解决协同感知中时延和噪声干扰问题，提升复杂交通场景下的感知性能。	scene understanding
35	FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation	提出FC-VFI，用于高帧率慢动作视频生成中的保真和一致性视频插帧	optical flow
36	Any to Full: Prompting Depth Anything for Depth Completion in One Stage	Any2Full：单阶段Prompt深度补全，提升机器人感知精度与效率	depth estimation monocular depth Depth Anything	✅
37	OWL: A Novel Approach to Machine Perception During Motion	提出OWL函数，利用视觉运动线索实现运动中机器感知	scene reconstruction

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems	提出数字孪生驱动的纺织品分类与异物识别系统，用于自动化分拣	manipulation dual-arm motion planning
39	Video-based Locomotion Analysis for Fish Health Monitoring	提出基于YOLOv11的多目标跟踪系统，用于鱼类健康监测的运动分析。	locomotion
40	RealWonder: Real-Time Physical Action-Conditioned Video Generation	RealWonder：首个基于物理作用条件下的实时视频生成系统	manipulation optical flow	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Accelerating Text-to-Video Generation with Calibrated Sparse Attention	CalibAtt：通过校准稀疏注意力加速文本到视频生成	spatiotemporal
42	Orthogonal Spatial-temporal Distributional Transfer for 4D Generation	提出正交时空分布迁移框架Orster，解决4D生成中数据匮乏问题。	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
43	SURE: Semi-dense Uncertainty-REfined Feature Matching	SURE：提出半稠密不确定性精炼特征匹配框架，提升图像匹配可靠性。	feature matching	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页