Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations

📄 arXiv: 2604.01639v1 📥 PDF

作者: Shou-Tzu Han, Rodrigue Rizk, KC Santosh

分类: cs.CL

发布日期: 2026-04-02

备注: Preprint. Under review at COLM 2026


💡 一句话要点

提出机制性诊断框架以解决大语言模型对表面扰动的脆弱性问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 机制性诊断 语义扰动 数学推理 层级发散放大 模型鲁棒性 激活修补 失败分类

📋 核心要点

  1. 现有的大语言模型在面对语义保持不变的扰动时表现出较高的答案翻转率,显示出其脆弱性。
  2. 论文提出了机制性扰动诊断(MPD)框架,结合多种分析手段以追踪模型失败的机制基础。
  3. 实验结果显示,Llama-3模型的局部失败可通过特定层的修复恢复12.2%的准确性,而其他模型的恢复效果较差。

📝 摘要(中文)

大型语言模型在数学推理基准测试中表现出色,但对保持语义不变的表面扰动却表现出意外的脆弱性。本文系统评估了三种开放权重的语言模型,发现它们在677个GSM8K问题上的答案翻转率高达28.8%-45.1%。为了解这些失败的机制基础,本文引入了机制性扰动诊断(MPD)框架,结合了多种分析方法,提出了一种新的层级发散放大指标(CAI),并通过针对性修复实验验证了失败分类的有效性。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在面对语义保持不变的扰动时的脆弱性,现有方法未能有效识别和修复这些失败。

核心思路:通过引入机制性扰动诊断(MPD)框架,结合多种分析技术,深入探讨模型失败的机制,并提出新的失败分类方法。

技术框架:MPD框架包括四个主要模块:logit透视分析、激活修补、组件消融和层级发散放大指数(CAI),形成一个统一的诊断流程。

关键创新:CAI作为一种新颖的指标,能够量化层级发散的放大效果,优于传统的首次发散层作为失败预测指标,提升了对模型脆弱性的理解。

关键设计:在实验中,针对不同模型的失败类型,采用了不同的修复策略,如引导向量和层微调,针对性地恢复局部失败。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,三种模型在面对语义保持不变的扰动时,答案翻转率高达28.8%-45.1%。CAI指标在预测失败方面表现优异,AUC值最高可达0.679。通过针对性修复,Llama-3模型的局部失败恢复率达到12.2%。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、智能问答系统和教育技术等。通过提高语言模型对扰动的鲁棒性,可以增强其在实际应用中的可靠性和准确性,进而推动智能系统的广泛应用。

📄 摘要(原文)

Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.