Resumo: | Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.
|