AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

作者: Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin

分类: cs.CL

发布日期: 2026-03-02

💡 一句话要点

AdaPonderLM：提出一种token自适应深度的门控Pondering语言模型，提升推理效率。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 自适应计算时间 循环Transformer 语言模型 提前退出 推理加速

📋 核心要点

现有循环Transformer语言模型在推理时计算量固定，对简单token造成浪费，缺乏token级别的自适应性。
AdaPonderLM通过学习token级别的提前退出，并结合KV重用机制，实现了token级别的计算自适应，提升推理效率。
实验表明，AdaPonderLM在降低推理计算量的同时，保持了与基线模型相当的语言建模困惑度和下游任务精度。

📝 摘要（中文）

本文提出AdaPonderLM，一种自监督的循环语言模型，它学习token级别的提前退出，无需手动调整每个token/每层的剪枝比例。AdaPonderLM使用迭代特定的MLP门控和一个单调停止掩码来决定每个token何时停止循环，并引入KV重用机制，为已停止的token重用缓存的键/值状态，确保训练-测试一致性和实际加速。在70M到410M（预训练）以及高达2.8B（持续预训练）的Pythia backbone上，AdaPonderLM在保持相当的语言建模困惑度和具有竞争力的下游精度的同时，降低了约10%的推理计算量。分析表明，学习到的门控为高NLL（困难）token分配了更多的计算量，在完全自监督的环境中表现出自适应计算时间的行为。同时，在iso-FLOPs下，学习到的停止策略始终优于固定剪枝，表明AdaPonderLM将计算量分配给正确的token，而不仅仅是降低平均深度。

🔬 方法详解

问题定义：现有循环Transformer语言模型在推理时，通常采用固定的迭代次数，这意味着无论token的难易程度，都会进行相同次数的计算。这种方式对于简单的token来说，造成了计算资源的浪费，而对于复杂的token，可能计算不足。因此，如何实现token级别的自适应计算深度，是需要解决的关键问题。

核心思路：AdaPonderLM的核心思路是让模型能够根据token的难易程度，自适应地决定每个token需要进行的计算次数。对于简单的token，模型可以提前停止计算，从而节省计算资源；对于复杂的token，模型可以进行更多的计算，以提高准确率。这种自适应性是通过学习token级别的提前退出策略来实现的。

技术框架：AdaPonderLM基于循环Transformer架构，并在每个循环迭代中引入了一个门控机制。该门控机制根据当前token的状态，决定是否需要继续进行计算。具体来说，模型使用一个迭代特定的MLP门控和一个单调停止掩码来决定每个token何时停止循环。此外，为了保证训练和测试的一致性，并实现实际的加速，模型还引入了一个KV重用机制，为已停止的token重用缓存的键/值状态。

关键创新：AdaPonderLM的关键创新在于其token级别的自适应计算深度。与传统的固定迭代次数的循环Transformer相比，AdaPonderLM能够根据token的难易程度，动态地调整计算量，从而提高了计算效率。此外，AdaPonderLM的KV重用机制也保证了训练和测试的一致性，并实现了实际的加速。

关键设计：AdaPonderLM的关键设计包括：1) 迭代特定的MLP门控，用于决定每个token是否停止循环；2) 单调停止掩码，用于保证token一旦停止计算，就不会再重新开始；3) KV重用机制，用于为已停止的token重用缓存的键/值状态。损失函数方面，模型采用标准的语言建模损失，并通过自监督的方式学习token级别的提前退出策略。

🖼️ 关键图片

📊 实验亮点

AdaPonderLM在Pythia backbone上进行了广泛的实验，结果表明，在70M到410M（预训练）以及高达2.8B（持续预训练）的模型上，AdaPonderLM在保持相当的语言建模困惑度和具有竞争力的下游精度的同时，降低了约10%的推理计算量。更重要的是，在iso-FLOPs的条件下，AdaPonderLM的学习到的停止策略始终优于固定剪枝，表明其能够将计算量分配给更重要的token。

🎯 应用场景

AdaPonderLM具有广泛的应用前景，尤其是在资源受限的场景下，例如移动设备或边缘计算环境。通过自适应地调整计算深度，AdaPonderLM可以在保证性能的同时，显著降低计算成本，从而使得大型语言模型能够在这些场景下部署和应用。此外，该方法还可以应用于其他需要自适应计算的领域，例如图像识别、语音识别等。

📄 摘要（原文）

Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理