DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

作者: Zengqi Zhao, Weidi Xia, Peter Wei, Yan Zhang, Yiyi Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Simiao Ren

分类: cs.CV

发布日期: 2026-03-02

💡 一句话要点

提出DOCFORGE-BENCH以解决文档伪造检测的评估问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control)

关键词: 文档伪造检测 零-shot学习 校准问题 数据集评估 生成式AI

📋 核心要点

现有文档伪造检测方法在多样化文档类型上表现不佳，尤其在缺乏标记数据的情况下。
DOCFORGE-BENCH通过统一的零-shot基准评估14种方法，强调校准问题而非表示能力的不足。
实验结果显示，所有方法在不同文档类型上均无法可靠工作，且校准适应能显著提升性能。

📝 摘要（中文）

我们提出DOCFORGE-BENCH，这是首个统一的零-shot文档伪造检测基准，评估了14种方法在八个数据集上的表现，涵盖文本篡改、收据伪造和身份文件操控。与以微调为导向的评估不同，DOCFORGE-BENCH在没有领域适应的情况下应用所有方法的预训练权重，反映了实际应用场景中缺乏标记文档训练数据的情况。我们的主要发现是，在单阈值协议下普遍存在的校准失败：方法的Pixel-AUC达到中等水平（>=0.76），但Pixel-F1接近零。AUC-F1差距并非是区分失败，而是分数分布的偏移，篡改区域仅占文档图像的0.27-4.17%像素，远低于自然图像基准，使得标准阈值tau=0.5严重失校。Oracle-F1比固定阈值的Pixel-F1高出2-10倍，确认了校准而非表示是瓶颈。通过控制校准实验验证了这一点：在N=10个领域图像上适应单一阈值，恢复了39-55%的Oracle-F1差距，表明阈值适应是实际部署的关键缺失步骤。总体而言，评估的所有方法在多样化文档类型上均无法可靠工作，强调了文档伪造检测仍然是一个未解决的问题。

🔬 方法详解

问题定义：本文旨在解决文档伪造检测中的评估问题，现有方法在缺乏标记数据的情况下表现不佳，导致校准失败。

核心思路：DOCFORGE-BENCH采用零-shot评估方式，使用预训练权重而不进行领域适应，反映真实应用场景中的挑战。

技术框架：整体架构包括数据集选择、方法评估和校准实验三个主要模块，涵盖文本篡改、收据伪造和身份文件操控等多种文档类型。

关键创新：最重要的创新在于识别并强调了校准问题，指出AUC-F1差距源于分数分布的偏移，而非区分能力的不足。

关键设计：采用固定阈值进行评估，发现标准阈值tau=0.5严重失校，提出通过适应单一阈值来恢复Oracle-F1差距的策略。实验验证了这一设计的有效性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，所有评估方法在多样化文档类型上均无法可靠工作，且校准适应能够恢复39-55%的Oracle-F1差距，强调了校准在文档伪造检测中的重要性。

🎯 应用场景

该研究在文档伪造检测领域具有重要应用潜力，能够为实际应用中的文档安全性提供有效评估工具，尤其在缺乏标记数据的情况下。未来，随着生成式AI技术的发展，DOCFORGE-BENCH将为应对新型伪造攻击提供基础。

📄 摘要（原文）

We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.

DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理