Early experiments on automatic annotation of Portuguese medieval texts

This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for...

Full description

Bibliographic Details
Main Author: Bico, M. I. (author)
Other Authors: Baptista, J. (author), Batista, F. (author), Cardeira, E. (author)
Format: conferenceObject
Language:eng
Published: 2022
Subjects:
Online Access:http://hdl.handle.net/10071/26157
Country:Portugal
Oai:oai:repositorio.iscte-iul.pt:10071/26157
Description
Summary:This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (∼155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.