SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

作者: Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

分类: cs.AI, cs.CL, cs.LG

发布日期: 2025-06-30 (更新: 2025-07-01)

备注: Work in Progress

💡 一句话要点

提出SPIRAL框架以解决自我监督的推理能力培养问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 自我对弈 零和游戏 强化学习 推理能力 多智能体系统 在线训练 角色条件优势估计 自主学习

📋 核心要点

现有方法依赖人工策划的问题和奖励，限制了推理能力的自主发展。
SPIRAL通过自我对弈的方式，模型在零和游戏中不断提升，消除人工监督需求。
在Kuhn Poker上训练Qwen3-4B-Base，数学和一般推理分别提升8.6%和8.4%，超越了基于专家轨迹的SFT。

📝 摘要（中文）

近年来强化学习的进展表明，语言模型可以通过训练获得复杂的推理能力，但这些方法依赖于人工策划的问题-答案对和领域特定的奖励工程。我们提出SPIRAL，一个自我对弈框架，模型通过与不断改进的自身进行多回合零和游戏学习，从而消除对人工监督的需求。SPIRAL通过自我对弈生成无限的逐步挑战问题，模型必须不断适应更强的对手。我们实现了一个完全在线的多回合多智能体强化学习系统，并提出了基于角色的优势估计（RAE）以稳定多智能体训练。使用SPIRAL，零和游戏的自我对弈产生了广泛转移的推理能力。

🔬 方法详解

问题定义：本论文旨在解决现有强化学习方法中对人工监督的依赖，限制了模型推理能力的自主发展。现有方法通常需要人工策划的问题和奖励，导致训练效率低下。

核心思路：SPIRAL框架通过自我对弈的方式，让模型在零和游戏中与不断改进的自身进行对抗，生成逐步递进的挑战，从而实现自主学习和推理能力的提升。

技术框架：SPIRAL的整体架构包括自我对弈的多回合多智能体强化学习系统，主要模块包括角色条件优势估计（RAE）和在线训练机制。模型在每一轮对弈中不断适应对手的策略，形成一个动态的学习环境。

关键创新：SPIRAL的核心创新在于通过自我对弈生成无限的训练课程，模型能够在没有人工干预的情况下，逐步提高推理能力。这种方法与传统的依赖人工设计的训练方式本质上不同。

关键设计：在技术细节上，SPIRAL采用了角色条件优势估计（RAE）来稳定多智能体训练过程，确保模型在对弈中能够有效学习。此外，训练过程中使用的损失函数和网络结构经过精心设计，以适应多回合的对抗环境。

📊 实验亮点

SPIRAL在Kuhn Poker上训练Qwen3-4B-Base，数学和一般推理分别提升8.6%和8.4%，超越了基于25,000条专家游戏轨迹的SFT。此外，多游戏训练（如TicTacToe和简单谈判）进一步提升了模型的推理能力，显示出零和游戏在推理能力转移中的有效性。

🎯 应用场景

SPIRAL框架的潜在应用场景包括教育、游戏AI和自动推理系统等领域。通过自主学习和推理能力的提升，SPIRAL可以用于开发更智能的对话系统、决策支持工具以及复杂问题求解的自动化系统，具有广泛的实际价值和未来影响。

📄 摘要（原文）

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册