NER in Archival Finding Aids

At the moment, the vast majority of Portuguese archives with an online presence use a software solution to manage their finding aids: e. g. Digitarq or Archeevo. Most of these finding aids are written in natural language without any annotation that would enable a machine to identify named entities,...

Full description

Bibliographic Details
Main Author: Cunha, Luís Filipe da Costa (author)
Other Authors: Ramalho, José Carlos (author)
Format: conferencePaper
Language:eng
Published: 2021
Subjects:
Online Access:http://hdl.handle.net/1822/73504
Country:Portugal
Oai:oai:repositorium.sdum.uminho.pt:1822/73504
Description
Summary:At the moment, the vast majority of Portuguese archives with an online presence use a software solution to manage their finding aids: e. g. Digitarq or Archeevo. Most of these finding aids are written in natural language without any annotation that would enable a machine to identify named entities, geographical locations or even some dates. That would allow the machine to create smart browsing tools on top of those record contents like entity linking and record linking. In this work we have created a set of datasets to train Machine Learning algorithms to find those named entities and geographical locations. After training several algorithms we tested them in several datasets and registered their precision and accuracy. These results enabled us to achieve some conclusions about what kind of precision we can achieve with this approach in this context and what to do with the results: do we have enough precision and accuracy to create toponymic and anthroponomic indexes for archival finding aids? Is this approach suitable in this context? These are some of the questions we intend to answer along this paper.