Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future

Marco Humbel, Julianne Nyhan*, Andreas Vlachidis, Kim Sloan, Alexandra Ortolja-Baird

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    221 Downloads (Pure)

    Abstract

    Purpose: By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward. 

    Design/methodology/approach: Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum. 

    Findings: Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required. 

    Research limitations/implications: This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here. 

    Practical implications: The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain. 

    Social implications: NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper. 

    Originality/value: This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article.

    Original languageEnglish
    Pages (from-to)1223-1247
    Number of pages25
    JournalJournal of Documentation
    Volume77
    Issue number6
    Early online date7 Jun 2021
    DOIs
    Publication statusPublished - 11 Oct 2021

    Keywords

    • artificial intelligence
    • data criticism
    • data ethics
    • digital humanities
    • information extraction
    • named-entity recognition

    Fingerprint

    Dive into the research topics of 'Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future'. Together they form a unique fingerprint.

    Cite this