Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

📄 arXiv: 2508.09666v2 📥 PDF

作者: Ziyang Ma, Qingyue Yuan, Linhai Zhang, Deyu Zhou

分类: cs.CL

发布日期: 2025-08-13 (更新: 2025-08-15)

备注: Preprint


💡 一句话要点

提出SlowED以解决小型语言模型安全性问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 小型语言模型 链式思维 蒸馏训练 模型安全性 推理能力 低熵屏蔽 慢调优

📋 核心要点

  1. 现有链式思维蒸馏方法在提升小型语言模型推理能力的同时,忽视了训练对模型安全性的负面影响。
  2. 本文提出的SLowED方法通过Slow Tuning和Low-Entropy Masking模块,优化模型权重并屏蔽不必要的学习目标,以确保模型安全性。
  3. 在对三种小型语言模型的实验中,SLowED在推理基准和安全评估上均表现出色,安全性得以保持且推理能力显著提升。

📝 摘要(中文)

现有的链式思维蒸馏方法主要通过利用强大的大型语言模型生成高质量的推理来增强小型语言模型的推理能力。然而,训练过程中对小型语言模型安全性造成的负面影响鲜有研究关注。本文提出了一种安全蒸馏方法Slow Tuning和Low-Entropy Masking Distillation(SLowED),通过减小模型权重变化幅度和屏蔽低熵标记,保持小型语言模型的安全性。实验结果表明,SLowED在推理能力上相较于现有蒸馏方法有显著提升,同时保持了模型的安全性。

🔬 方法详解

问题定义:本文旨在解决小型语言模型在链式思维蒸馏过程中可能出现的安全性问题。现有方法在提升推理能力的同时,可能导致模型对有害输入的脆弱性增加。

核心思路:SLowED方法通过减小模型权重的变化幅度(Slow Tuning)和屏蔽低熵标记(Low-Entropy Masking),以优化模型的学习过程,确保模型在推理能力提升的同时保持安全性。

技术框架:SLowED包含两个主要模块:Slow Tuning通过限制权重变化来优化模型,Low-Entropy Masking则通过屏蔽低熵标记来排除不必要的学习目标。整体流程是先进行Slow Tuning,再应用Low-Entropy Masking进行细化训练。

关键创新:SLowED的核心创新在于结合了Slow Tuning和Low-Entropy Masking两种策略,前者确保模型在初始权重附近进行优化,后者则有效排除低熵标记,避免不必要的学习干扰。这与现有方法的直接权重调整和全标记学习方式形成了鲜明对比。

关键设计:在Slow Tuning中,模型权重变化的幅度被严格控制,以保持模型的稳定性;在Low-Entropy Masking中,设定了低熵标记的阈值,以决定哪些标记应被屏蔽。实验中使用的损失函数和网络结构均经过精心设计,以确保训练过程的有效性和安全性。

📊 实验亮点

实验结果显示,SLowED在推理基准(如BBH、BB-Sub、ARC、AGIEval)和安全评估(AdvBench)上均表现优异,保持了小型语言模型的安全性,并在推理能力上相较于现有蒸馏方法有显著提升,具体提升幅度未知。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理中的对话系统、文本生成和问答系统等。通过提升小型语言模型的推理能力并确保其安全性,SLowED可以在实际应用中减少模型对有害输入的脆弱性,从而提高用户信任度和系统的可靠性。

📄 摘要(原文)

Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model's safety in the early stage and the latter prolonging the safe training epochs.