Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

作者: Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, Xi Li

分类: cs.CV

发布日期: 2025-08-18

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出紧凑注意力机制以加速视频生成

🎯 匹配领域: 支柱八：物理动画 (Physics-based Animation)

关键词: 视频生成 自注意力机制 时空稀疏性 深度学习 变换器 计算效率 动态分块 自动配置搜索

📋 核心要点

现有的稀疏注意力方法未能充分利用视频数据的时空冗余，导致计算效率低下。
提出紧凑注意力机制，通过自适应分块、时间变化窗口和自动配置搜索来优化注意力计算。
在单GPU设置下，方法实现了1.6~2.5倍的加速，同时保持视觉质量，与全注意力基线相当。

📝 摘要（中文）

自注意力机制的计算需求对基于变换器的视频生成构成了重大挑战，尤其是在合成超长序列时。现有方法如分解注意力和固定稀疏模式未能充分利用视频数据中的时空冗余。通过对视频扩散变换器的系统分析，本文发现注意力矩阵呈现出结构化但异质的稀疏模式，特定的头部动态关注不同的时空区域。为此，本文提出了紧凑注意力机制，一个硬件感知的加速框架，包含三项创新：自适应分块策略、时间变化窗口和自动配置搜索算法。该方法在单GPU设置下实现了1.6~2.5倍的注意力计算加速，同时保持与全注意力基线相当的视觉质量。

🔬 方法详解

问题定义：本文旨在解决基于变换器的视频生成中自注意力机制的高计算需求，现有方法如分解注意力和固定稀疏模式未能有效利用视频数据的时空冗余，导致性能瓶颈。

核心思路：提出紧凑注意力机制，通过动态调整注意力计算的稀疏性，以适应视频数据的结构化稀疏性，从而提高计算效率和生成质量。

技术框架：整体框架包括三个主要模块：自适应分块策略、时间变化窗口和自动配置搜索算法。自适应分块策略通过动态分组来近似多样的空间交互模式；时间变化窗口根据帧的接近程度调整稀疏性；自动配置搜索算法优化稀疏模式，同时保留关键的注意力路径。

关键创新：最重要的创新在于引入了动态的稀疏模式和自适应策略，使得注意力计算不仅高效且灵活，克服了现有方法的刚性约束和显著开销。

关键设计：在设计中，采用了动态分块和时间变化窗口的参数设置，确保在不同时间帧之间的稀疏性调整，同时使用了优化的损失函数来平衡计算效率与生成质量。

📊 实验亮点

实验结果表明，紧凑注意力机制在单GPU设置下实现了1.6~2.5倍的加速，且在视觉质量上与全注意力基线相当。这一显著提升展示了该方法在视频生成任务中的有效性和潜力。

🎯 应用场景

该研究的潜在应用领域包括视频生成、实时视频处理和长视频理解等。通过提高视频生成的效率，紧凑注意力机制能够在多种实际场景中实现更快的处理速度，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive tiling strategies that approximate diverse spatial interaction patterns via dynamic tile grouping, 2) Temporally varying windows that adjust sparsity levels based on frame proximity, and 3) An automated configuration search algorithm that optimizes sparse patterns while preserving critical attention pathways. Our method achieves 1.6~2.5x acceleration in attention computation on single-GPU setups while maintaining comparable visual quality with full-attention baselines. This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation. Project Page: https://yo-ava.github.io/Compact-Attention.github.io/

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册