UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

📄 arXiv: 2604.02190v1 📥 PDF

作者: Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang

分类: cs.CV, cs.RO

发布日期: 2026-04-02

备注: code has been released at https://github.com/xiaomi-research/unidrivevla

🔗 代码/项目: GITHUB


💡 一句话要点

提出UniDriveVLA以解决自动驾驶中的感知与推理冲突问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-行动 自动驾驶 空间感知 语义推理 专家解耦 混合变换器 三阶段训练 深度学习

📋 核心要点

  1. 现有VLA模型在空间感知与语义推理之间存在妥协,导致性能受限。
  2. UniDriveVLA通过专家解耦的方式,分别处理驾驶理解、场景感知和行动规划。
  3. 在nuScenes和Bench2Drive的评估中,UniDriveVLA取得了最先进的性能,展示了其广泛的应用潜力。

📝 摘要(中文)

近年来,视觉-语言-行动(VLA)模型在自动驾驶领域逐渐崭露头角,旨在利用丰富的世界知识提升驾驶系统的认知能力。然而,现有VLA模型在空间感知与语义推理之间面临关键困境,导致其性能受限。为了解决这一问题,本文提出了UniDriveVLA,一个基于混合变换器的统一驾驶视觉-语言-行动模型,通过专家解耦来解决感知与推理的冲突。该模型包含三个专家,分别负责驾驶理解、场景感知和行动规划,并通过掩蔽联合注意力进行协调。此外,结合稀疏感知范式和三阶段渐进训练策略,UniDriveVLA在多个感知、预测和理解任务上表现出色,展示了其作为统一模型在自动驾驶中的广泛适用性。

🔬 方法详解

问题定义:本文旨在解决现有视觉-语言-行动模型在自动驾驶任务中面临的空间感知与语义推理之间的矛盾。现有方法往往在这两者之间做出妥协,导致性能不足。

核心思路:UniDriveVLA通过专家解耦的设计理念,将空间感知与语义推理分开处理,从而避免了共享模型参数导致的优化冲突。

技术框架:该模型由三个专家组成,分别负责驾驶理解、场景感知和行动规划。专家之间通过掩蔽联合注意力进行协调,确保各自的任务能够独立优化。

关键创新:UniDriveVLA的主要创新在于专家解耦机制,使得空间感知与语义推理能够独立优化,显著提升了模型的整体性能。与现有方法相比,这种设计避免了性能的相互干扰。

关键设计:模型采用稀疏感知范式,结合三阶段的渐进训练策略,以提升空间感知能力,同时保持语义推理的有效性。具体的参数设置和损失函数设计在实验中经过精心调整,以确保最佳性能。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在nuScenes和Bench2Drive的评估中,UniDriveVLA在开放环评估和闭环评估中均表现出色,达到了最先进的性能。具体而言,该模型在3D检测、在线映射和运动预测等任务上均显著优于现有基线,展示了其强大的综合能力。

🎯 应用场景

UniDriveVLA的研究成果在自动驾驶领域具有广泛的应用潜力,能够提升自动驾驶系统在复杂环境中的决策能力。其独特的感知与推理解耦设计,可能为未来的智能交通系统和自动驾驶技术的发展提供新的思路和方法。

📄 摘要(原文)

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla