Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

📄 arXiv: 2509.00629v1 📥 PDF

作者: Md Tanzib Hosain, Md Kishor Morol

分类: cs.CL

发布日期: 2025-08-30

备注: Accepted in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Student Research Workshop), 2025

🔗 代码/项目: GITHUB


💡 一句话要点

提出多轮自我精炼单代理语言模型以解决复杂编程问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 语言模型 编程问题 多轮推理 自我判断 信息检索 算法思维 ICPC基准

📋 核心要点

  1. 现有语言模型在解决复杂编程问题时表现不佳,尤其是在算法思维和代码生成方面的能力有限。
  2. 本文提出了一种结合多轮自我判断、反思和检索的推理技术,以提高语言模型在编程任务中的表现。
  3. 实验结果显示,最佳推理技术的通过率从19.1%提升至42.2%,并且在特定指令下,o1模型能解决17个之前无法解决的问题。

📝 摘要(中文)

在竞争编程中,解决复杂的算法问题对人类来说是极具挑战性的任务。尽管这一领域尚未得到足够关注,本文提出了ICPC基准,包含254个国际大学编程竞赛任务。通过零-shot链式思维提示,o1模型的通过率仅为19.1%。而结合多轮自我判断、反思和检索的最佳推理技术,使通过率提升至42.2%。此外,研究还通过人机协作深入探讨了模型的局限性,发现o1在特定指令下能够解决17个之前无法解决的问题。我们的研究为具备扎实、富有想象力和算法思维的语言模型奠定了基础,并开源了相关代码和数据。

🔬 方法详解

问题定义:本文旨在解决语言模型在复杂编程问题上的低通过率,现有方法在算法思维和代码生成方面存在显著不足,难以应对竞争编程的挑战。

核心思路:论文提出的核心思路是通过多轮自我判断和反思结合检索技术,利用历史信息提升模型的推理能力,从而更有效地解决编程问题。

技术框架:整体架构包括数据准备、模型训练和推理三个主要阶段。首先,利用ICPC基准数据集进行模型训练,然后在推理阶段应用多轮自我判断和反思机制,最后通过检索技术增强模型的上下文理解能力。

关键创新:最重要的技术创新在于将多轮自我判断与检索结合,形成了一种新的推理框架,这与传统的单轮推理方法有本质区别,显著提升了模型的解决能力。

关键设计:在模型设计中,采用了特定的损失函数以优化多轮对话的生成效果,同时设置了参数以平衡自我判断与检索信息的权重,确保模型在推理时能够充分利用历史信息。

📊 实验亮点

实验结果显示,o1模型在最佳推理技术下的通过率从19.1%提升至42.2%,并且在特定指令下成功解决了17个之前无法解决的问题,展现了显著的性能提升。

🎯 应用场景

该研究的潜在应用领域包括教育、编程辅助工具和自动化代码生成等。通过提升语言模型在编程任务中的表现,可以为学生和开发者提供更有效的学习和工作支持,未来可能在软件开发和算法设计中发挥重要作用。

📄 摘要(原文)

Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.