cs.CV(2026-01-08)

📊 共 31 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (6) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models 提出GeM-VG,一个用于广义多图视觉定位的多模态大语言模型。 large language model multimodal visual grounding
2 SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models 提出SOVABench车辆监控行为检索基准,用于评估多模态大语言模型 large language model multimodal instruction following
3 Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models 提出Forge-and-Quench框架,利用理解增强图像生成保真度 multimodal instruction following
4 Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable AgentCompress:任务感知压缩降低大语言模型Agent的科研成本 large language model
5 VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice VideoAuto-R1:通过“一次思考,两次回答”实现高效视频自动推理 large language model multimodal chain-of-thought
6 Atlas 2 -- Foundation models for clinical deployment Atlas 2:用于临床部署的病理学视觉基础模型,兼顾性能、鲁棒性和效率。 foundation model
7 Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics 提出ProtoScore以解决多模态评估中的原型偏差问题 multimodal
8 Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering 提出Vision-Language Introspection,通过可解释的双向因果引导缓解多模态大语言模型中的幻觉问题 large language model multimodal
9 Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing Re-Align:结构化推理引导的上下文图像生成与编辑框架 multimodal chain-of-thought
10 AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection 提出AIVD框架,通过边缘-云协同实现精确高效的工业视觉检测 large language model multimodal
11 MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing 提出MiLDEAgent,解决多层设计文档的细粒度编辑难题。 multimodal instruction following
12 All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction 提出RepMD,通过设计概念重现提升不断演变的有害Meme检测 large language model multimodal
13 Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform 针对工业GenAI平台,扩展视觉语言模型以处理药物长视频推理任务。 multimodal
14 Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition 提出基于骨架化的对抗扰动方法,攻击大视觉语言模型的数学文本识别能力 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
15 QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer 提出QNeRF,一种基于量子计算机的新视角合成方法,在参数量更少的情况下匹配或超越经典NeRF。 representation learning NeRF neural radiance field
16 RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes 提出RL-AWB,利用深度强化学习解决低光夜景场景下的自动白平衡校正问题 reinforcement learning deep reinforcement learning
17 UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition UniLiPs:利用几何约束动态场景分解的统一LiDAR伪标签方法 MAE geometric consistency foundation model
18 DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation DB-MSMUNet:双分支多尺度Mamba UNet用于胰腺CT扫描分割 Mamba state space model
19 VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control VerseCrafter:提出4D几何控制的动态真实视频世界模型,实现相机和物体运动的精确控制。 world model
20 CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models 提出CounterVid框架,通过对抗视频生成缓解视频语言模型中的动作和时间幻觉问题 direct preference optimization multimodal
21 HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation 提出HUR-MACL模型,解决头颈部多器官分割中小器官分割精度低的问题。 Mamba distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
22 ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting ProFuse:高效跨视角上下文融合的开放词汇3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
23 OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction OceanSplat:利用三目一致性的水下场景物体感知高斯溅射重建 3D gaussian splatting gaussian splatting splatting
24 DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation DivAS:提出一种基于深度加权体素聚合的NeRF交互式3D分割方法 NeRF neural radiance field foundation model
25 Pixel-Perfect Visual Geometry Estimation 提出Pixel-Perfect视觉几何模型,通过像素空间生成建模实现高质量、无飞点的点云重建。 depth estimation monocular depth foundation model
26 ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos ObjectForesight:提出一种从人类视频中预测未来3D物体轨迹的物体中心动力学模型 affordance egocentric geometric consistency
27 MoE3D: A Mixture-of-Experts Module for 3D Reconstruction 提出MoE3D模块,利用混合专家模型提升3D重建深度边界质量。 VGGT

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
28 RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation RoboVIP:利用视觉身份提示增强多视角视频生成,提升机器人操作性能 manipulation vision-language-action
29 Plenoptic Video Generation PlenopticDreamer:提出一种保持时空一致性的多视角视频生成框架 manipulation dreamer

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 From Rays to Projections: Better Inputs for Feed-Forward View Synthesis 提出基于投影变换的条件输入,提升前馈视角合成的几何一致性和鲁棒性。 geometric consistency

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
31 Decentralized Privacy-Preserving Federal Learning of Computer Vision Models on Edge Devices 研究边缘设备上计算机视觉模型的去中心化隐私保护联邦学习方法 OMOMO

⬅️ 返回 cs.CV 首页 · 🏠 返回主页