Post-Training Quantization for Video Matting
作者: Tianrui Zhu, Houyuan Chen, Ruihao Gong, Michele Magno, Haotong Qin, Kai Zhang
分类: cs.CV, cs.AI
发布日期: 2025-06-12
💡 一句话要点
提出后训练量化框架以解决视频抠图模型的资源限制问题
🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics)
关键词: 视频抠图 后训练量化 模型压缩 光流辅助 统计校准 计算机视觉 资源受限设备
📋 核心要点
- 现有视频抠图模型在资源受限设备上部署时,计算复杂度高且难以保持准确性和时间一致性。
- 提出了一种两阶段的PTQ策略,结合块重建优化和全局参数校准,以减少准确性损失。
- 实验结果表明,PTQ4VM在不同位宽下的准确性达到最先进水平,4位量化性能接近全精度模型,同时节省8倍FLOP。
📝 摘要(中文)
视频抠图在电影制作和虚拟现实等应用中至关重要,但在资源受限设备上部署其计算密集型模型面临挑战。量化是模型压缩和加速的关键技术。后训练量化(PTQ)作为一种高效的方法,在视频抠图领域仍处于初步阶段,面临保持准确性和时间一致性的重大挑战。为此,本文提出了一种新颖且通用的PTQ框架,专门针对视频抠图模型,标志着该领域的首次系统性尝试。我们的贡献包括:一种结合块重建优化的两阶段PTQ策略、统计驱动的全局仿射校准方法以及光流辅助组件,显著提升了模型在复杂场景中区分移动前景的能力。
🔬 方法详解
问题定义:本文旨在解决视频抠图模型在资源受限设备上部署时的计算复杂度和准确性问题。现有的后训练量化方法在保持模型性能方面存在显著不足,尤其是在视频处理任务中。
核心思路:提出了一种新颖的后训练量化框架,采用两阶段策略,首先进行快速稳定的初始量化,然后通过全局校准量化参数来最小化准确性损失。这种设计旨在有效捕捉局部依赖性并减少统计失真。
技术框架:整体框架包括两个主要阶段:第一阶段是基于块重建的优化,快速进行初始量化;第二阶段是全局仿射校准(GAC),用于调整量化参数。还引入了光流辅助组件(OFA),利用时间和语义先验指导量化过程。
关键创新:最重要的技术创新在于统计驱动的全局仿射校准方法,能够补偿因忽略BN层效应而导致的累积统计失真,显著降低现有PTQ方法在视频抠图任务中的误差。
关键设计:在参数设置上,采用了适应性调整的量化策略,损失函数设计考虑了模型在不同场景下的表现,网络结构上结合了光流信息以增强模型对动态前景的识别能力。
📊 实验亮点
实验结果显示,PTQ4VM在不同位宽下的准确性超过了现有的量化方法,尤其是在4位量化时,其性能接近全精度模型,且实现了8倍的FLOP节省,展现了显著的效率提升。
🎯 应用场景
该研究的潜在应用领域包括电影制作、虚拟现实和增强现实等场景,能够在资源受限的设备上实现高效的视频抠图处理,提升用户体验。未来,该框架可能推动更多计算机视觉任务的量化研究,促进智能设备的普及和应用。
📄 摘要(原文)
Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model's ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.