cs.CV(2026-04-01)

📊 共 47 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱七:动作重定向 (Motion Retargeting) (4) 支柱一:机器人控制 (Robot Control) (3) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 CL-VISTA: Benchmarking Continual Learning in Video Large Language Models 提出CL-VISTA基准,用于评估视频大语言模型中的持续学习能力。 large language model foundation model multimodal
2 Multimodal Language Models Cannot Spot Spatial Inconsistencies 提出多视角空间一致性评估方法,揭示多模态大语言模型在3D推理上的不足 large language model multimodal
3 First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models 提出First Logit Boosting以解决大型视觉语言模型中的对象幻觉问题 multimodal visual grounding
4 Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation 提出IPnP框架,利用Foundation Model迭代提示和伪标签解决医学图像部分标注分割问题 foundation model
5 YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction YieldSAT:一个用于高分辨率作物产量预测的多模态基准数据集 multimodal
6 Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors 利用3D基础模型先验,实现视角鲁棒的端到端自动驾驶 foundation model
7 Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding 提出Think, Act, Build框架,利用VLM实现零样本3D视觉定位,无需预处理点云。 visual grounding
8 DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale 提出DVGT-2,用于大规模自动驾驶场景下的视觉-几何-动作在线规划。 vision-language-action VLA
9 An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models 提出SurgSTU-Pipeline,用于生成精细化手术视频时空理解数据集,提升VLM性能 large language model multimodal
10 The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation 利用MLLM与SAM3的免训练方案,解决运动中心语言表达下的视频目标分割问题 large language model multimodal
11 Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge 提出TEP:通过追踪增强提示改进复杂视频对象分割,荣获PVUW MOSE挑战赛冠军 large language model multimodal
12 ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration ONE-SHOT:通过空间解耦运动注入和混合上下文集成实现可组合的人-环境视频合成 foundation model
13 JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation 提出JAMMEval,用于可靠评估日语视觉语言模型 visual grounding
14 TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection 提出TF-SSD:通过协同Mask过滤的免训练共显著性目标检测方法 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
15 A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation CheXOne:一种用于胸部X光片解释的、具备推理能力的视觉-语言基础模型 reinforcement learning foundation model visual grounding
16 LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation LinguDistill:通过选择性跨模态蒸馏恢复视觉-语言模型中的语言能力 distillation multimodal visual grounding
17 STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO STAR:通过转折点对齐和分段DPO缓解空间推理中的级联错误 DPO direct preference optimization large language model
18 Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding 提出查询条件证据关键帧采样方法,提升MLLM长视频理解性能 reinforcement learning large language model multimodal
19 KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering 提出KG-CMI框架,利用知识图谱增强跨模态交互,提升医疗视觉问答性能。 Mamba multimodal
20 MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy MATHENA:基于Mamba的牙齿解剖结构分层估计与整体评估网络 Mamba SSM state space model
21 Mine-JEPA: In-Domain Self-Supervised Learning for Mine-Like Object Classification in Side-Scan Sonar Mine-JEPA:用于侧扫声呐水雷目标分类的域内自监督学习 JEPA foundation model
22 AceTone: Bridging Words and Colors for Conditional Image Grading AceTone:提出一种多模态条件下的图像色彩分级方法,弥合文本与色彩之间的鸿沟。 reinforcement learning VQ-VAE multimodal
23 MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning MAESIL:一种用于增强自监督医学图像学习的掩码自编码器 masked autoencoder VQ-VAE
24 VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space VADMamba++:基于灰度空间混合建模的高效视频异常检测 Mamba optical flow
25 TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning 提出TTA-Vid,一种用于视频推理的通用测试时自适应方法 reinforcement learning multimodal
26 Learnability-Guided Diffusion for Dataset Distillation 提出学习可指导的扩散方法以解决数据集蒸馏冗余问题 distillation
27 FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography FreqPhys:利用生理频率先验知识增强鲁棒性远程光电容积脉搏波信号提取 representation learning PULSE
28 A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR 提出基于状态空间模型的OCR架构,在历史报纸识别中实现效率与精度平衡。 Mamba SSM

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
29 Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization Diff3R:结合前馈预测与不确定性感知优化,提升3D高斯溅射渲染质量 3D gaussian splatting 3DGS gaussian splatting
30 DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization DirectFisheye-GS:通过跨视角联合优化实现高斯溅射原生鱼眼相机输入 3D gaussian splatting 3DGS gaussian splatting
31 TRiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting TRiGS:提出时序刚体运动的4D高斯溅射,解决长时序动态场景重建问题。 gaussian splatting splatting scene reconstruction
32 Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation 提出MoA-DepthCLIP以解决单目深度估计问题 depth estimation monocular depth
33 RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection 提出RegFormer,通过可迁移的关系建模实现高效弱监督人-物交互检测 scene understanding human-object interaction HOI
34 ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction 提出自回归高斯点云生成框架以解决3D对象生成问题 gaussian splatting splatting
35 Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction 提出神经谐波纹理,提升基于图元的神经重建质量,实现高质量新视角合成。 3D gaussian splatting gaussian splatting splatting
36 Autoregressive Appearance Prediction for 3D Gaussian Avatars 提出3D高斯头像模型以解决头像驱动中的外观不稳定问题 3D gaussian splatting gaussian splatting splatting
37 TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking TRACE:基于可触重建和几何对齐上下文视频掩蔽的高保真3D场景编辑框架 3DGS

🔬 支柱七:动作重定向 (Motion Retargeting) (4 篇)

#题目一句话要点标签🔗
38 Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting 提出DANCEMATCH,通过量化运动结构表征实现舞蹈动作指纹识别 motion representation
39 Sparkle: A Robust and Versatile Representation for Point Cloud based Human Motion Capture 提出Sparkle,一种用于点云人体运动捕捉的鲁棒且通用的新表示方法。 human motion
40 PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition PrivHAR-Bench:面向视频行为识别的分级隐私基准数据集 human motion
41 Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations Reliev3R:解除前馈重建模型对多视角几何标注的依赖 geometric consistency

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
42 EgoSim: Egocentric World Simulator for Embodied Interaction Generation EgoSim:用于具身交互生成的第一人称视角世界模拟器 manipulation egocentric cross-embodiment
43 DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving DLWM:双重潜在世界模型实现自动驾驶中以高斯为中心的整体预训练 motion planning world model world models
44 Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation 提出FADE框架,通过查询状态操控攻击基于传播的多目标跟踪器 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
45 A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video 提出基于单目腹腔镜视频的4D表征,实现免训练的手术智能体推理 spatiotemporal large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
46 ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data ReMoGen:通过模块化学习和多样化数据,实现实时人机交互反应生成。 motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
47 Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using Open-Source Photogrammetry 基于开源摄影测量生成亚米级月球数字高程模型 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页