RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

作者: Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

分类: cs.AI

发布日期: 2026-03-03

备注: Submit to KDD 2026

💡 一句话要点

RAPO：通过检索增强策略优化扩展LLM Agent的探索能力

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: Agentic强化学习 大型语言模型 探索策略 检索增强 策略优化

📋 核心要点

现有Agentic RL方法依赖on-policy探索，限制了Agent发现新推理视角的能力，阻碍了性能提升。
RAPO通过引入检索机制，扩展Agent在训练期间的探索范围，使其能够利用off-policy数据进行学习。
实验结果表明，RAPO在多个Agentic推理任务中取得了显著的性能提升，并提高了训练效率。

📝 摘要（中文）

Agentic强化学习(Agentic RL)在基于大型语言模型(LLM)的Agent中展现出显著潜力。这些工作使LLM Agent能够通过多步骤、工具集成的推理来处理复杂任务。然而，现有Agentic RL方法的一个固有局限性是它们依赖于纯粹的on-policy范式进行探索，将探索限制在Agent自身生成的输出中，从而阻碍了发现新的推理视角以进行进一步改进。虽然最近的一些工作结合了辅助的off-policy信号来增强探索，但它们通常使用完整的off-policy轨迹进行轨迹级别的策略估计，忽略了Agentic rollout中细粒度的、步骤级别的探索动态的必要性。在本文中，我们重新审视了Agentic RL中的探索，并提出了检索增强策略优化(RAPO)，这是一个新的RL框架，它引入了检索来显式地扩展训练期间的探索。为此，我们将Agentic RL训练过程分解为两个阶段：(i)混合策略Agentic Rollout，以及(ii)检索感知策略优化。具体来说，我们提出了一种混合策略Agentic Rollout策略，该策略允许Agent持续推理检索到的off-policy步骤级轨迹。它动态地扩展了Agent的推理感受野，从而能够基于外部行为进行更广泛的探索。随后，我们引入了检索感知策略优化机制，该机制通过检索奖励和重要性塑造来校准策略梯度估计，从而稳定训练并优先考虑检索启发式的探索。大量的实验表明，RAPO在三个Agentic推理任务的14个数据集上实现了+5.0%的平均增益，同时提供了1.2倍的训练效率。

🔬 方法详解

问题定义：现有Agentic RL方法主要依赖于on-policy的探索方式，即Agent只能基于自身产生的行为进行学习和改进。这种方式的局限性在于，Agent难以跳出自身的思维框架，无法发现新的、更有效的推理路径和策略，从而限制了其解决复杂问题的能力。此外，虽然一些方法尝试引入off-policy数据，但通常只关注轨迹级别的策略估计，忽略了Agentic rollout中细粒度的步骤级别探索动态。

核心思路：RAPO的核心思路是通过引入检索机制，显式地扩展Agent的探索空间。具体来说，RAPO允许Agent在推理过程中检索并利用外部的、off-policy的步骤级轨迹，从而打破了on-policy探索的限制，使Agent能够学习到更多样化的行为模式和推理策略。这种方法类似于为Agent提供了一个“外部记忆”，使其能够借鉴其他Agent的经验。

技术框架：RAPO的整体框架包含两个主要阶段：混合策略Agentic Rollout和检索感知策略优化。在混合策略Agentic Rollout阶段，Agent在进行推理时，不仅会基于自身的策略生成行为，还会检索相关的off-policy步骤级轨迹，并将这些轨迹融入到推理过程中，从而扩展其推理感受野。在检索感知策略优化阶段，RAPO利用检索到的信息来校准策略梯度估计，通过检索奖励和重要性塑造，稳定训练过程，并优先考虑那些能够提供启发式探索的检索结果。

关键创新：RAPO最重要的创新点在于将检索机制引入到Agentic RL的探索过程中。与传统的on-policy方法相比，RAPO能够利用off-policy数据进行学习，从而扩展了Agent的探索空间。与现有的off-policy方法相比，RAPO更加关注步骤级别的探索动态，能够更精细地控制Agent的探索行为。此外，RAPO还设计了检索感知策略优化机制，以确保训练过程的稳定性和效率。

关键设计：RAPO的关键设计包括：(1)混合策略Agentic Rollout策略，该策略决定了Agent如何选择和利用检索到的off-policy轨迹；(2)检索奖励函数，该函数用于评估检索到的轨迹的质量，并指导Agent选择更有价值的轨迹；(3)重要性塑造机制，该机制用于调整策略梯度估计，以平衡on-policy和off-policy数据的影响。具体的参数设置和网络结构等细节未在摘要中详细说明，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，RAPO在三个Agentic推理任务的14个数据集上实现了平均5.0%的性能提升，并且训练效率提高了1.2倍。这些结果表明，RAPO能够有效地扩展Agent的探索能力，提高其性能和效率。具体的基线模型和数据集信息未在摘要中详细说明，属于未知信息。

🎯 应用场景

RAPO具有广泛的应用前景，可以应用于各种需要复杂推理和决策的Agentic任务，例如智能助手、游戏AI、机器人控制等。通过扩展Agent的探索能力，RAPO可以帮助Agent更好地解决复杂问题，提高其性能和鲁棒性。此外，RAPO还可以促进Agent之间的知识共享和协作，从而加速Agent的进化和发展。

📄 摘要（原文）

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理