Authorship attribution in portuguese using character N-grams

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of...

ver descrição completa

Detalhes bibliográficos
Autor principal: Markov, Ilia (author)
Outros Autores: Baptista, Jorge (author), Pichardo-Lagunas, Obdulia (author)
Formato: article
Idioma:eng
Publicado em: 2018
Assuntos:
Texto completo:http://hdl.handle.net/10400.1/11987
País:Portugal
Oai:oai:sapientia.ualg.pt:10400.1/11987
Descrição
Resumo:For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.