Successor Heads: Recurring, Interpretable Attention Heads In The Wild

作者: Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

分类: cs.LG, cs.AI, cs.CL

发布日期: 2023-12-14

备注: 12 main text pages, with appendix

💡 一句话要点

发现并解释了大型语言模型中具有递增功能的successor heads注意力头

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 可解释性机制 注意力头 successor heads mod-10 features 向量运算 模型编辑

📋 核心要点

现有研究缺乏对大型语言模型内部运作的深入理解，玩具模型的结果难以推广到前沿模型。
论文提出successor heads的概念，这是一种能够递增具有自然顺序token的注意力头，并用可解释性机制方法进行分析。
研究发现successor heads在不同架构和大小的LLM中都存在，并揭示了其内部的'mod-10 features'，可用于编辑head行为。

📝 摘要（中文）

本文介绍了successor heads，这是一种能够递增具有自然顺序的token（如数字、月份和日期）的注意力头。例如，successor heads可以将'Monday'递增为'Tuesday'。我们采用基于可解释性机制的方法来解释successor heads的行为，该领域旨在以人类可理解的方式解释模型如何完成任务。现有的研究已经在小型玩具模型中发现了可解释的语言模型组件。然而，玩具模型中的结果尚未转化为解释前沿模型内部运作的见解，并且目前对大型语言模型的内部运作知之甚少。在本文中，我们分析了大型语言模型（LLM）中successor heads的行为，发现它们实现了不同架构通用的抽象表示。它们在参数量低至3100万，高达120亿的LLM中形成，例如GPT-2、Pythia和Llama-2。我们发现了一组'mod-10 features'，它们是successor heads在不同架构和大小的LLM中递增的基础。我们使用这些特征进行向量运算来编辑head的行为，并深入了解LLM中的数字表示。此外，我们研究了successor heads在自然语言数据上的行为，识别了Pythia successor head中可解释的多义性。

🔬 方法详解

问题定义：现有方法难以解释大型语言模型（LLM）的内部运作机制，特别是如何处理具有顺序关系的token，例如数字、日期和月份。之前的研究主要集中在小型玩具模型上，其结果难以推广到更大、更复杂的LLM。因此，需要一种方法来理解LLM中处理此类顺序关系的机制。

核心思路：论文的核心思路是通过识别和分析LLM中的successor heads，来理解模型如何处理具有自然顺序的token。successor heads是一种特殊的注意力头，其功能是将一个token递增到下一个token（例如，将“Monday”递增到“Tuesday”）。通过研究这些head的行为，可以深入了解LLM内部的抽象表示和计算过程。

技术框架：论文的研究方法主要包括以下几个步骤：1) 在不同架构和大小的LLM（例如GPT-2、Pythia和Llama-2）中识别successor heads。2) 分析这些head的注意力模式和激活，以理解它们如何实现递增功能。3) 识别 underlying 的 'mod-10 features'，这些特征是successor heads实现递增功能的基础。4) 使用向量运算来编辑这些特征，并观察对head行为的影响。5) 研究successor heads在自然语言数据上的行为，以识别可能存在的多义性。

关键创新：论文的关键创新在于识别并解释了LLM中的successor heads，并揭示了其内部的'mod-10 features'。这为理解LLM如何处理具有顺序关系的token提供了一种新的视角。此外，论文还展示了如何使用向量运算来编辑这些特征，从而控制head的行为，这为模型的可控性和可解释性提供了新的方法。

关键设计：论文的关键设计包括：1) 使用注意力模式和激活来识别successor heads。2) 通过分析head的权重矩阵来识别'mod-10 features'。3) 使用向量运算（例如加法和减法）来编辑这些特征。4) 设计实验来验证编辑后的head行为是否符合预期。论文没有详细说明具体的参数设置、损失函数或网络结构，因为重点在于分析现有模型的内部机制。

📊 实验亮点

研究发现successor heads存在于多种不同架构和大小的LLM中，包括GPT-2、Pythia和Llama-2，参数量从3100万到120亿不等。论文识别出'mod-10 features'，并展示了如何通过向量运算编辑这些特征来改变successor heads的行为。此外，研究还发现了Pythia successor head中存在可解释的多义性。

🎯 应用场景

该研究成果可应用于提升大型语言模型的可解释性和可控性。通过理解successor heads的工作原理，可以更好地控制模型生成文本的行为，例如，可以编辑模型使其更准确地处理日期和时间信息。此外，该研究还可以为开发更高效的数字推理模型提供新的思路。

📄 摘要（原文）

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册