Named Entity Recognition using Machine Learning techniques

Knowledge extraction through keywords and relation creation between contents with common keywords is an important asset in any content management system. Nevertheless, it is impossible to perform manually this kind of information extraction due to the growing amount of textual content of varying qua...

Full description

Bibliographic Details
Main Author: Miranda, Nuno (author)
Other Authors: Raminhos, Ricardo (author), Seabra, Pedro (author), Sequeira, João (author), Gonçalves, Teresa (author), Quaresma, Paulo (author)
Format: article
Language:eng
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10174/4418
Country:Portugal
Oai:oai:dspace.uevora.pt:10174/4418
Description
Summary:Knowledge extraction through keywords and relation creation between contents with common keywords is an important asset in any content management system. Nevertheless, it is impossible to perform manually this kind of information extraction due to the growing amount of textual content of varying quality made available by multiple creators and distributors of information. This paper presents and evaluates a prototype developed for the recognition of named entities using orthographic and morphologic word attributes as input and Support Vector Machines as the machine learning technique for identifying those entities in new documents. Since documents are written in the Portuguese language and there was no part-of-speech tagger freely available, a model for this language was also developed using SVMTool, a simple and effective generator of sequential taggers based on Support Vector Machines. This implied adapting the Bosque 8.0 corpus by adding a POS tag to every word, since originally several words were joined into one token with a unique tag and others were split giving rise to more than one tag.