Using IR techniques to improve Automated Text Classification

This paper performs a study on the pre-processing phase of the automated text classification problem. We use the linear Support Vector Machine paradigm applied to datasets written in the English and the European Portuguese languages – the Reuters and the Portuguese Attorney General’s Office datasets...

ver descrição completa

Detalhes bibliográficos
Autor principal: Gonçalves, Teresa (author)
Outros Autores: Quaresma, Paulo (author)
Formato: article
Idioma:eng
Publicado em: 2011
Assuntos:
Texto completo:http://hdl.handle.net/10174/2557
País:Portugal
Oai:oai:dspace.uevora.pt:10174/2557
Descrição
Resumo:This paper performs a study on the pre-processing phase of the automated text classification problem. We use the linear Support Vector Machine paradigm applied to datasets written in the English and the European Portuguese languages – the Reuters and the Portuguese Attorney General’s Office datasets, respectively. The study can be seen as a search, for the best document representa- tion, in three different axes: the feature reduction (using linguistic in- formation), the feature selection (using word frequencies) and the term weighting (using information retrieval measures).