Structured Multidimensional Representation Learning for Large Language Models

作者: Alaa El Ichi, Khalide Jbilou, Mohamed El Guide, Franck Dufrenois

分类: cs.CL, math.NA

发布日期: 2026-03-05

备注: 25 pages, 6 figures. Preprint of a journal submission

💡 一句话要点

提出基于L-product的张量Transformer，有效降低大语言模型参数冗余并提升泛化能力。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 模型压缩 张量分解 谱分解 Transformer 自然语言处理 离散余弦变换

📋 核心要点

Transformer模型参数量巨大，嵌入维度存在冗余，限制了其在资源受限场景的应用。
提出基于L-product的张量分解方法，将token表示转化为谱张量切片，在变换域进行计算，降低参数量。
实验表明，该方法能在减少编码器参数的同时，保持甚至提升模型在文本分类任务上的准确率。

📝 摘要（中文）

Transformer架构在模式识别和自然语言处理任务中表现出色，但其扩展伴随着参数的大幅增长和嵌入维度中的冗余。本文提出了一种基于三阶张量L-product的结构化谱分解嵌入空间方法。通过将token表示重塑为谱张量切片，并在变换域中执行注意力和前馈操作，我们获得了一种张量Transformer架构，该架构将编码器分解为p个独立的谱子Transformer，同时保留了标准Transformer的语义。我们证明了所提出的L-Transformer在谱上等价于p个并行Transformer，这些Transformer在降维嵌入上运行，从而在固定总嵌入大小下，编码器参数减少约1/p（直到偏置和归一化参数等低阶项）。当使用实值离散余弦变换（DCT）实例化时，该方法保持完全可微性，并与现有的训练流程兼容。除了压缩之外，谱分解还引入了嵌入频率上的归纳偏置，从而实现依赖于切片的频率缩放，从而提高泛化能力。在IMDB和AG News上的实验表明，所提出的模型可以显著减少编码器参数（对于p=4，高达75%），同时保持具有竞争力的准确性。在IMDB上，张量化编码器在压缩下匹配或优于标准基线，而在AG News上，在中等宽度下，我们观察到准确性略有下降，以换取4倍的编码器减少；在BERT-base宽度（d=768）下，性能恢复到同等水平。

🔬 方法详解

问题定义：现有Transformer模型在扩展时面临参数量巨大和嵌入维度冗余的问题，这增加了计算成本和存储需求，限制了其在资源受限设备上的部署。现有方法通常采用剪枝、量化等技术来压缩模型，但这些方法可能会导致性能下降或需要复杂的优化过程。

核心思路：本文的核心思路是将Transformer的嵌入空间进行结构化谱分解，利用L-product将token表示重塑为谱张量切片，并在变换域中进行注意力和前馈操作。通过这种方式，可以将编码器分解为多个独立的谱子Transformer，从而降低参数量，同时引入频率相关的归纳偏置，提升泛化能力。

技术框架：L-Transformer架构主要包含以下几个步骤：1) 将token嵌入表示重塑为三阶张量；2) 对张量进行谱分解，得到多个谱张量切片；3) 在变换域中，对每个谱张量切片应用独立的子Transformer；4) 将子Transformer的输出进行逆变换，得到最终的token表示。整个过程保持了标准Transformer的语义，并且可以端到端地训练。

关键创新：最重要的技术创新点在于利用L-product进行张量分解，将嵌入空间分解为多个独立的谱子空间。这种分解方式不仅降低了参数量，还引入了频率相关的归纳偏置，使得模型能够更好地学习token表示。与现有方法相比，该方法不需要复杂的优化过程，并且可以与现有的训练流程兼容。

关键设计：论文中使用了实值离散余弦变换（DCT）作为谱分解的基函数，这使得整个模型保持完全可微性。此外，论文还提出了切片依赖的频率缩放方法，可以根据不同切片的频率特性调整学习率，进一步提升模型的泛化能力。编码器参数减少比例约为1/p，其中p为子Transformer的数量。

🖼️ 关键图片

📊 实验亮点

在IMDB和AG News文本分类任务上的实验结果表明，L-Transformer能够在显著减少编码器参数的同时，保持甚至提升模型的准确率。例如，在IMDB数据集上，张量化编码器在压缩情况下匹配或优于标准基线。在AG News数据集上，在BERT-base宽度下，性能恢复到同等水平，同时编码器参数减少了4倍（p=4时，高达75%）。

🎯 应用场景

该研究成果可应用于各种自然语言处理任务，尤其适用于资源受限的场景，如移动设备上的文本分类、机器翻译等。通过降低模型参数量，可以减少计算成本和存储需求，使得大型语言模型能够在更多设备上部署和应用。此外，该方法引入的频率相关的归纳偏置，有助于提升模型在小样本学习和领域泛化方面的性能。

📄 摘要（原文）

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

Structured Multidimensional Representation Learning for Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理