Polylingual text classification in the legal domain

With the globalization trend there is a big amount of documents writ- ten in different languages. If these polylingual documents are already organized into existing categories one can deliver a learning model to classify newly arrived polylingual documents. Despite being able to adopt a na ̈ıve appr...

Full description

Bibliographic Details
Main Author: Gonçalves, Teresa (author)
Other Authors: Quaresma, Paulo (author)
Format: article
Language:eng
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10174/4582
Country:Portugal
Oai:oai:dspace.uevora.pt:10174/4582
Description
Summary:With the globalization trend there is a big amount of documents writ- ten in different languages. If these polylingual documents are already organized into existing categories one can deliver a learning model to classify newly arrived polylingual documents. Despite being able to adopt a na ̈ıve approach by considering the problem as multiple independent monolingual text classification problems, this approach fails to use the opportunity offered by polylingual training documents to improve the effectiveness of the classifier. This paper proposes a method to combine different monolingual classifiers in order to get a new classifier as good as the best monolingual one having also the ability to deliver the best performance measures possible (precision, recall and F1). The proposed methodology was applied to a corpus of legal documents – from the EUR-Lex site – and was evaluated. The obtained results were quite good, indicating that combining different monolingual classifiers may be a promising approach to reach the best performance for each category independently of the language.