Data Enrichment for Data Mining Applied to Bioinformatics and Cheminformatics Domains

Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of dat...

ver descrição completa

Detalhes bibliográficos
Autor principal: Luís Ricardo Marques Oliveira (author)
Formato: masterThesis
Idioma:eng
Publicado em: 2021
Assuntos:
Texto completo:https://hdl.handle.net/10216/137172
País:Portugal
Oai:oai:repositorio-aberto.up.pt:10216/137172
Descrição
Resumo:Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of data and information available on the Web. Researchers are already using Data Mining and Machine Learning as a valuable tool in their researches, albeit the usual procedure is to look for the information based on induction models. So far, despite the great successes already achieved using Data Mining and Machine Learning, it is not easy to integrate this vast amount of available information in the inductive process with propositional algorithms. Our main motivation is to address the problem of integrating domain information into the inductive process of propositional Data Mining and Machine Learning techniques by enriching the training data to be used in inductive logic programming systems. The algorithms of propositional machine learning are very dependent on data attributes. It still is hard to identify which attributes are more suitable for a particular task in the research. It is also hard to extract relevant information from the enormous quantity of data available. We will concentrate the available data, derive features that ILP algorithms can use to induce descriptions, solving the problems. We are creating a web platform to obtain relevant bioinformatics (particularly Genomics) and Cheminformatics problems. It fetches the data from public repositories with genomics, protein and chemical data. After the data enrichment, Prolog systems use inductive logic programming to induce rules and solve specific Bioinformatics and Cheminformatics case studies. To assess the impact of the data enrichment with ILP, we compare with the results obtained solving the same cases using propositional algorithms.