Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

作者: Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo

分类: cs.CL, cs.AI, cs.LG, cs.LO, cs.NE

发布日期: 2025-08-05 (更新: 2025-08-27)

备注: 27 pages total (10-page main paper + 17-page appendix), 12 figures, 6 tables. Submitted to ICML 2025 (under review)

期刊: ICML 2025

🔗 代码/项目: GITHUB

💡 一句话要点

提出Putnam-AXIOM以解决LLMs数学推理基准的饱和问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 数学推理 大型语言模型 基准测试 动态评估 功能变体 记忆化现象

📋 核心要点

现有的数学推理基准已接近饱和，且受到训练集污染的影响，导致评估结果不可靠。
提出Putnam-AXIOM基准，通过引入功能变体生成新问题，提供抗污染的评估框架。
实验表明，强模型在变体集上的表现明显下降，揭示了记忆化现象，强调动态基准的重要性。

📝 摘要（中文）

当前针对大型语言模型（LLMs）的数学推理基准已接近饱和，部分模型的准确率超过90%，但训练集污染问题日益严重。本文提出Putnam-AXIOM基准，包含522个来自威廉·洛厄尔·普特南数学竞赛的大学级竞赛问题，以及通过程序性扰动生成的100个功能变体。该变体协议能够生成无限数量的同等难度的新实例，从而提供一个抗污染的测试平台。实验结果显示，OpenAI的o1-preview模型在原始数据集上的得分为41.9%，而在变体集上的准确率下降了19.6%。这些结果表明了记忆化的现象，并强调了动态基准的必要性。

🔬 方法详解

问题定义：本文旨在解决现有数学推理基准的饱和和训练集污染问题，导致评估结果的准确性下降。

核心思路：通过引入Putnam-AXIOM基准和功能变体，提供一个动态且抗污染的评估框架，以更准确地测量LLMs的数学推理能力。

技术框架：整体架构包括两个主要部分：原始问题集（522个竞赛问题）和变体集（100个功能变体），后者通过程序性扰动生成，形成无限的新实例。

关键创新：最重要的创新在于引入了功能变体生成机制，使得基准测试能够持续产生新问题，避免了模型的记忆化现象。

关键设计：在实验中，使用了Teacher-Forced Accuracy（TFA）作为轻量级评估指标，直接评分推理过程，并自动化自然语言证明的评估。

📊 实验亮点

实验结果显示，OpenAI的o1-preview模型在原始数据集上的准确率为41.9%，而在变体集上下降至46.8%，下降幅度达到19.6%。这一现象表明了模型的记忆化问题，强调了动态基准的重要性。

🎯 应用场景

该研究为大型语言模型的数学推理能力提供了一个新的评估标准，具有广泛的应用潜力，尤其是在教育、科学研究和自动化推理等领域。未来，Putnam-AXIOM可能推动更高效的模型训练和评估方法的发展。

📄 摘要（原文）

Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册