Summary: | Knowledge extraction through keywords and relation creation between contents with common keywords is an important asset in any content management system. Nevertheless, it is impossible to perform manually this kind of information extraction due to the growing amount of textual content of varying quality made available by multiple creators and distributors of information. This paper presents and evaluates a prototype developed for the recognition of named entities using orthographic and morphologic word attributes as input and Support Vector Machines as the machine learning technique for identifying those entities in new documents. Since documents are written in the Portuguese language and there was no part-of-speech tagger freely available, a model for this language was also developed using SVMTool, a simple and effective generator of sequential taggers based on Support Vector Machines. This implied adapting the Bosque 8.0 corpus by adding a POS tag to every word, since originally several words were joined into one token with a unique tag and others were split giving rise to more than one tag.
|