Validation of Automated Protein Annotation

Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated to...

Full description

Bibliographic Details
Main Author: Couto, Francisco M. (author)
Other Authors: Silva, Mário J. (author), Coutinho, Pedro M. (author)
Format: report
Language:por
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/10451/14256
Country:Portugal
Oai:oai:repositorio.ul.pt:10451/14256
Description
Summary:Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations