LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

作者: Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani

分类: cs.CL, cs.AI, cs.LG

发布日期: 2026-04-02

备注: Project page: https://livemathematicianbench.github.io/

💡 一句话要点

提出LiveMathematicianBench，用于评估LLM在研究级数学推理中的能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 数学推理 大型语言模型 基准测试 定理证明 知识推理

📋 核心要点

现有数学推理基准存在合成数据和数据污染问题，无法真实反映LLM在研究级数学问题上的推理能力。
LiveMathematicianBench利用arXiv新论文构建动态基准，并设计证明草图引导的干扰项，提高评估难度和真实性。
实验表明，现有最佳模型在LiveMathematicianBench上表现远未饱和，抗替换评估下准确率显著下降，证明基准的有效性。

📝 摘要（中文）

本文提出了LiveMathematicianBench，一个动态的多项选择基准，用于评估大型语言模型（LLM）在研究级别的数学推理能力。该基准基于模型训练截止日期之后发表的arXiv论文，避免了数据污染。它通过使用新发表的定理进行评估，提供了一个超越记忆模式的现实测试平台。该基准引入了包含十三种定理类型的逻辑分类（例如，蕴含、等价、存在、唯一性），从而能够对各种推理形式进行细粒度评估。它采用了一种证明草图引导的干扰项生成流程，该流程使用高层次的证明策略来构建看似合理但无效的答案选项，反映了误导性的证明方向，从而提高了对真正理解的敏感性。此外，还引入了一种抗替换机制，以区分答案识别和实质性推理。评估结果表明，该基准远未饱和：最佳模型Gemini-3.1-pro-preview仅达到43.5%的准确率。在抗替换评估下，准确率急剧下降：GPT-5.4得分最高，为30.6%，而Gemini-3.1-pro-preview降至17.6%，低于20%的随机基线。双模式协议显示，访问证明草图可以带来持续的准确率提升，表明模型可以利用高层次的证明策略进行推理。总而言之，LiveMathematicianBench为研究LLM中研究级别的数学推理提供了一个可扩展的、抗污染的测试平台。

🔬 方法详解

问题定义：现有数学推理基准存在局限性，例如使用合成数据或包含模型训练时已有的数据（数据污染），无法准确评估LLM在研究级别的数学推理能力。此外，现有基准难以区分模型是真正理解了数学原理，还是仅仅通过表面匹配识别了答案。

核心思路：LiveMathematicianBench的核心思路是创建一个动态、抗污染、且能区分实质性推理的数学推理基准。通过使用模型训练截止日期之后发表的arXiv论文中的定理，避免数据污染。利用证明草图引导的干扰项生成流程，构造看似合理但错误的答案，增加评估难度。引入抗替换机制，防止模型通过简单的答案识别来作弊。

技术框架：LiveMathematicianBench的整体框架包括以下几个主要模块： 1. 定理选择：从arXiv上选择新发表的数学论文中的定理。 2. 逻辑分类：将定理按照13种逻辑类型进行分类（例如，蕴含、等价、存在、唯一性）。 3. 证明草图生成：为每个定理生成高层次的证明草图，描述证明的关键步骤和策略。 4. 干扰项生成：利用证明草图，生成看似合理但无效的答案选项，作为干扰项。 5. 抗替换机制：设计一种机制，防止模型通过简单的答案识别来作弊。 6. 评估协议：设计双模式评估协议，允许模型选择是否访问证明草图。

关键创新：LiveMathematicianBench的关键创新在于： 1. 动态基准：使用新发表的论文，避免数据污染。 2. 证明草图引导的干扰项生成：提高评估难度，更准确地评估模型的推理能力。 3. 抗替换机制：防止模型通过简单的答案识别来作弊。 4. 逻辑分类：提供细粒度的评估，了解模型在不同逻辑类型上的表现。

关键设计： 1. 证明草图的粒度：证明草图需要足够详细，能够指导干扰项的生成，但又不能过于详细，以至于泄露答案。 2. 干扰项的相似度：干扰项需要与正确答案在表面上相似，但逻辑上错误，以增加评估难度。 3. 抗替换机制的实现：具体实现方式未知，但需要能够有效防止模型通过简单的答案识别来作弊。 4. 评估指标：使用准确率作为主要评估指标，并分析模型在不同逻辑类型上的表现。

🖼️ 关键图片

📊 实验亮点

实验结果表明，即使是目前最先进的LLM（Gemini-3.1-pro-preview）在LiveMathematicianBench上的准确率也仅为43.5%。在抗替换评估下，准确率显著下降，Gemini-3.1-pro-preview甚至低于随机基线（17.6% vs 20%），表明该基准对LLM的数学推理能力提出了巨大挑战。访问证明草图可以带来持续的准确率提升，说明模型可以利用高层次的证明策略进行推理。

🎯 应用场景

LiveMathematicianBench可用于评估和改进LLM在数学、科学和工程领域的推理能力。通过该基准，可以推动LLM在定理证明、数学建模、科学发现等方面的应用，并最终提升AI在复杂问题求解方面的能力。

📄 摘要（原文）

Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理