A large Portuguese corpus on-line: cleaning and preprocessing

We present a newly available on-line resource for Portuguese,a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous toits publication on-line. We focu...

Full description

Bibliographic Details
Main Author: Généreux, Michel (author)
Other Authors: Hendrickx, Iris (author), Mendes, Amália (author)
Format: conferenceObject
Language:eng
Published: 2019
Subjects:
Online Access:http://hdl.handle.net/10451/37430
Country:Portugal
Oai:oai:repositorio.ul.pt:10451/37430
Description
Summary:We present a newly available on-line resource for Portuguese,a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous toits publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.