Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

作者: Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha

分类: cs.LG, cs.PL

发布日期: 2025-08-06

备注: 18 pages, 19 figures. For artifacts, see https://agnostics.abgru.me

💡 一句话要点

提出Agnostics以解决低资源编程语言的后训练问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 低资源编程语言 后训练 强化学习 语言无关 模型性能提升

📋 核心要点

现有大型语言模型在低资源编程语言上表现不佳，缺乏足够的训练数据和后训练基础设施。
Agnostics通过语言无关的后训练管道，利用外部可观察行为评估代码，简化了多语言支持的工程需求。
在五种低资源语言上，Agnostics显著提升了模型性能，达到了新的状态-of-the-art结果。

📝 摘要（中文）

大型语言模型（LLMs）在高资源语言（如Python和JavaScript）上表现优异，但在低资源语言上却面临挑战。除了预训练数据的短缺，后训练过程也成为瓶颈。本文提出Agnostics，一个语言无关的后训练管道，通过仅依据代码的外部可观察行为来评估代码，从而消除每种语言的工程需求。具体而言，Agnostics利用LLM将现有单元测试数据集重写为I/O格式，提供简短配置以指导编译和运行目标语言，并在稳健的代码执行环境中应用可验证奖励的强化学习。该方法在五种低资源语言上取得了显著提升，推动了模型性能的进步。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在低资源编程语言上的后训练瓶颈，现有方法需要为每种语言构建新的数据集和测试框架，效率低下。

核心思路：Agnostics的核心思路是通过外部可观察行为来评估代码，避免了每种语言的特定工程需求，从而实现语言无关的后训练。

技术框架：Agnostics的整体架构包括三个主要模块：首先，使用LLM将现有的单元测试数据集转换为I/O格式；其次，提供简短的配置文件以指导如何编译和运行目标语言；最后，在稳健的代码执行环境中应用可验证奖励的强化学习。

关键创新：Agnostics的最大创新在于其语言无关的后训练管道，能够通过单一验证器测试任何语言的解决方案，与现有方法相比，显著降低了工程复杂性。

关键设计：在设计中，Agnostics使用了简短的YAML配置文件来指导训练过程，确保了灵活性和可扩展性，同时在强化学习中采用了可验证奖励机制，以提高模型的学习效率。

📊 实验亮点

在五种低资源语言（Lua、Julia、R、OCaml和Fortran）上，Agnostics显著提升了Qwen-3 4B模型的性能，使其与其他16B-70B开源模型相媲美。此外，对于≤16B参数模型，Agnostics在MultiPL-E和新引入的多语言版本LiveCodeBench上设定了新的state-of-the-art通过率。

🎯 应用场景

Agnostics的研究成果具有广泛的应用潜力，特别是在科学和工程领域中，能够支持多种低资源编程语言的开发和测试。通过简化后训练过程，该方法可以加速新语言的模型适应，促进跨语言编程的研究与实践。

📄 摘要（原文）

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册