RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

作者: Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis, David Magnotti, Wayne Fullen

分类: cs.CR, cs.AI, cs.CL, cs.LG, cs.SE

发布日期: 2026-04-02

备注: 11 pages, 10 figures. To be submitted to CAMLIS 2026

💡 一句话要点

RuleForge：大规模自动化生成和验证Web漏洞检测规则

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: Web漏洞检测 自动化规则生成 大型语言模型 置信度验证 网络安全

📋 核心要点

现有安全团队手动开发漏洞检测机制的速度远低于新漏洞的出现速度，自动化是关键。
RuleForge利用Nuclei模板自动生成JSON格式的检测规则，并使用LLM进行置信度验证，提升规则质量。
实验表明，RuleForge的验证方法AUROC达到0.75，假阳性降低67%，显著提升了漏洞检测的准确性。

📝 摘要（中文）

安全团队面临的挑战是：新披露的常见漏洞和暴露(CVE)的数量远远超过了手动开发检测机制的能力。仅2025年，国家漏洞数据库就发布了超过48,000个新漏洞，这激发了对自动化的需求。我们提出了RuleForge，这是一个AWS内部系统，可以从描述CVE细节的结构化Nuclei模板中自动生成检测规则——基于JSON的模式，用于识别利用特定漏洞的恶意HTTP请求。Nuclei模板提供标准化的、基于YAML的漏洞描述，作为我们规则生成过程的结构化输入。本文重点介绍RuleForge的架构和CVE相关威胁检测的运营部署，特别强调我们新颖的LLM-as-a-judge（大型语言模型作为评判者）置信度验证系统和系统化的反馈集成机制。这种验证方法从两个维度评估候选规则——敏感性（避免假阴性）和特异性（避免假阳性）——在生产中实现了0.75的AUROC，并且与仅使用合成测试的验证相比，减少了67%的假阳性。我们的5x5生成策略（五个并行候选规则，每个规则最多进行五次改进尝试）与持续的反馈循环相结合，实现了系统性的质量改进。我们还介绍了能够从非结构化数据源生成规则的扩展，并演示了用于多事件类型检测的概念验证代理工作流。我们的经验教训强调了将LLM应用于网络安全任务的关键考虑因素，包括过度自信的缓解，以及在提示设计和通过人工参与验证生成的规则的质量审查中领域专业知识的重要性。

🔬 方法详解

问题定义：论文旨在解决安全团队无法及时有效地应对大量新出现的Web漏洞的问题。现有方法依赖于手动编写检测规则，效率低下且难以覆盖所有漏洞。此外，已有的自动化方法在规则验证方面存在不足，容易产生大量的误报和漏报。

核心思路：论文的核心思路是利用结构化的漏洞描述（Nuclei模板）作为输入，自动化生成检测规则，并使用大型语言模型（LLM）作为评判者，对生成的规则进行置信度验证，从而提高规则的质量和效率。通过持续的反馈循环，不断改进规则生成和验证过程。

技术框架：RuleForge系统主要包含以下几个模块：1) 规则生成器：从Nuclei模板中提取信息，生成候选的检测规则。2) LLM验证器：使用LLM评估候选规则的敏感性和特异性。3) 反馈集成模块：收集LLM的评估结果和人工反馈，用于改进规则生成和验证过程。4) 规则部署模块：将验证通过的规则部署到生产环境中。

关键创新：论文的关键创新在于使用LLM作为评判者进行规则验证。传统的规则验证方法通常依赖于合成测试，难以模拟真实的网络环境，容易产生误报。而LLM具有强大的语义理解能力，可以更好地评估规则的有效性。此外，论文还提出了5x5生成策略和持续反馈循环，进一步提高了规则的质量。

关键设计：LLM验证器的prompt设计是关键。论文需要设计合适的prompt，引导LLM从敏感性和特异性两个维度评估规则。此外，还需要设计合适的损失函数，用于训练LLM。5x5生成策略是指并行生成五个候选规则，并对每个规则进行最多五次改进尝试。持续反馈循环是指不断收集LLM的评估结果和人工反馈，用于改进规则生成和验证过程。

🖼️ 关键图片

📊 实验亮点

RuleForge通过LLM验证，在生产环境中实现了0.75的AUROC，相比于仅使用合成测试的验证方法，假阳性降低了67%。5x5生成策略和持续反馈循环显著提升了规则的质量。该系统能够从非结构化数据源生成规则，并支持多事件类型检测。

🎯 应用场景

RuleForge可应用于大规模Web应用程序的安全防护，帮助安全团队快速部署针对新漏洞的检测规则，降低安全风险。该研究成果也可推广到其他类型的漏洞检测和安全分析任务中，例如恶意代码检测、入侵检测等，具有广泛的应用前景。

📄 摘要（原文）

Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理