Named Entity Recognition of Historical Text via Large Language Model

📄 arXiv: 2508.18090v1 📥 PDF

作者: Shibingfeng Zhang, Giovanni Colavizza

分类: cs.DL, cs.AI, cs.CL

发布日期: 2025-08-25


💡 一句话要点

利用大型语言模型解决历史文本命名实体识别问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 命名实体识别 大型语言模型 历史文本 信息提取 零样本学习 少样本学习 自然语言处理

📋 核心要点

  1. 传统的命名实体识别方法依赖于大量标注数据,而历史文本的标注数据稀缺,导致NER系统的开发面临挑战。
  2. 本文提出利用大型语言模型(LLMs)进行历史文本的NER,采用零样本和少样本提示策略,减少对任务特定训练数据的依赖。
  3. 实验结果显示,LLMs在HIPE-2022数据集上的NER任务中表现良好,尽管未达到完全监督模型的性能,但结果仍然具有前景。

📝 摘要(中文)

大型语言模型在自然语言处理任务中展现出卓越的多功能性,其中命名实体识别(NER)是识别和分类文本中专有名词的重要任务。传统的NER方法依赖于大量标注数据,但历史文本的标注数据稀缺且成本高昂。本文探讨了在历史文献中应用大型语言模型进行NER的可行性,采用零样本和少样本提示策略,实验结果表明,尽管性能不及完全监督模型,但大型语言模型在低资源环境下的信息提取中展现出良好的潜力。

🔬 方法详解

问题定义:本文旨在解决历史文本中的命名实体识别(NER)问题,现有方法面临标注数据稀缺和历史语言变异性等挑战。

核心思路:通过应用大型语言模型(LLMs)并采用零样本和少样本提示策略,减少对大量标注数据的需求,从而提高历史文本的NER性能。

技术框架:整体方法包括数据预处理、模型选择、提示设计和结果评估等主要模块,利用LLMs进行实体识别。

关键创新:本研究的创新在于将大型语言模型应用于低资源的历史文本NER任务,提供了一种新的信息提取方式,区别于传统的监督学习方法。

关键设计:在模型训练中,采用了适应历史文本特征的提示设计,确保模型能够理解和处理古老语言的变异性,同时优化了模型的参数设置以提高识别准确性。

📊 实验亮点

实验结果表明,采用大型语言模型的NER方法在HIPE-2022数据集上取得了显著的性能,尽管未达到完全监督模型的水平,但在低资源环境下的表现仍然令人鼓舞,展示了LLMs在历史文本处理中的潜力。

🎯 应用场景

该研究的潜在应用领域包括历史文献的数字化处理、信息检索系统以及考古学和历史研究中的数据挖掘。通过提高历史文本的NER能力,可以更有效地提取和组织信息,促进对历史数据的深入分析与理解,具有重要的实际价值和未来影响。

📄 摘要(原文)

Large language models have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.