cs.CV(2025-10-30)

📊 共 29 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (7) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles 面向自动驾驶,综述融合LLM/VLM的新一代多模态目标检测技术 large language model multimodal
2 OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research OracleAgent:用于甲骨文研究的多模态推理Agent系统 large language model multimodal
3 AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception AD-SAM:微调SAM视觉基础模型,用于自动驾驶感知 foundation model
4 ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection ProstNFound+:利用医学基础模型实现前列腺癌微超声检测的前瞻性研究 foundation model
5 SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation SpinalSAM-R1:用于脊柱CT分割的视觉-语言多模态交互系统 multimodal
6 MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation 提出MoME:一种用于医学影像分割的视觉语言混合专家模型 large language model foundation model
7 WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios WOD-E2E:针对端到端驾驶中长尾场景的Waymo开放数据集 large language model multimodal
8 Semantic Frame Aggregation-based Transformer for Live Video Comment Generation 提出基于语义帧聚合Transformer的SFAT模型,用于生成直播视频评论。 multimodal
9 OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes OmniX:利用全景生成与感知,生成可用于图形渲染的3D场景 multimodal
10 SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models 提出SteerVLM以增强视觉语言模型的控制能力 multimodal
11 Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition 提出表征级反事实校准方法,解决零样本识别中的上下文偏差问题 multimodal
12 Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models 提出AoT-PsyPhyBENCH基准,评估视觉-语言模型对视频时间方向的理解能力 multimodal
13 ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts ConceptScope:通过解耦视觉概念表征来量化和识别数据集偏差。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
14 JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting 提出JOGS,联合优化位姿估计和3D高斯溅射,无需预校准输入。 3D gaussian splatting gaussian splatting splatting
15 The Impact and Outlook of 3D Gaussian Splatting 3D高斯溅射技术综述:回顾进展、洞察方向、展望未来应用 3D gaussian splatting 3DGS gaussian splatting
16 DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting 提出方向一致性驱动的自适应密度控制方法DC4GS,提升3D高斯 Splatting的重建质量和效率。 3D gaussian splatting gaussian splatting splatting
17 HEIR: Learning Graph-Based Motion Hierarchies 提出HEIR,学习基于图的运动层次结构,实现数据驱动的运动建模。 3D gaussian splatting gaussian splatting splatting
18 Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2 利用深度学习光流法,提升RADARSAT-2卫星图像海冰漂移估计的可靠性 optical flow
19 AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency MOVAI:提出一种时序一致的AI驱动高质量文本到视频生成框架 scene understanding multimodal
20 MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models MoTDiff:利用扩散模型从单张模糊图像中估计高分辨率运动轨迹 optical flow

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
21 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning ThinkMorph:通过多模态交错CoT推理涌现视觉操作能力 manipulation multimodal chain-of-thought
22 Emu3.5: Native Multimodal Models are World Learners Emu3.5:原生多模态模型,通过预测视觉和语言的下一个状态实现世界理解。 manipulation reinforcement learning world model
23 Self-Improving Vision-Language-Action Models with Data Generation via Residual RL 提出PLD框架,通过残差强化学习和数据生成自提升视觉-语言-动作模型 manipulation reinforcement learning vision-language-action
24 Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving 提出CATG,利用约束流匹配进行端到端自动驾驶轨迹生成,解决模仿学习模式崩塌问题。 manipulation imitation learning flow matching
25 Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark 评估视频模型零样本推理能力:提出MME-CoF基准并分析Veo-3的推理局限性 manipulation

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
26 Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning 提出增量关系蒸馏框架IRD,解决开放世界中人-物交互的持续学习问题 representation learning distillation human-object interaction
27 The Quest for Generalizable Motion Generation: Data, Model, and Evaluation 提出ViMoGen框架,通过迁移视频生成知识,提升3D人体动作生成模型的泛化能力。 flow matching motion generation multimodal
28 EgoExo-Con: Exploring View-Invariant Video Temporal Understanding 提出EgoExo-Con基准与View-GRPO框架,提升视频LLM视角不变的时间理解能力 reinforcement learning egocentric

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark 提出CRAG-MM:一个用于可穿戴设备场景的多模态多轮对话RAG综合基准。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页