Grid data mining by means of learning classifier systems and distributed model induction

This paper introduces a distributed data mining approach suited to grid computing environments based on a supervised learning classifier system. Different methods of merging data mining models generated at different distributed sites are explored. Centralized Data Mining (CDM) is a conventional meth...

ver descrição completa

Detalhes bibliográficos
Autor principal: Santos, Manuel Filipe (author)
Outros Autores: Mathew, Wesley (author), Santos, Henrique Dinis dos (author)
Formato: conferencePaper
Idioma:eng
Publicado em: 2011
Assuntos:
Texto completo:http://hdl.handle.net/1822/15195
País:Portugal
Oai:oai:repositorium.sdum.uminho.pt:1822/15195
Descrição
Resumo:This paper introduces a distributed data mining approach suited to grid computing environments based on a supervised learning classifier system. Different methods of merging data mining models generated at different distributed sites are explored. Centralized Data Mining (CDM) is a conventional method of data mining in distributed data. In CDM, data that is stored in distributed locations have to be collected and stored in a central repository before executing the data mining algorithm. CDM method is reliable; however it is expensive (computational, communicational and implementation costs are high). Alternatively, Distributed Data Mining (DDM) approach is economical but it has limitations in combining local models. In DDM, the data mining algorithm has to be executed at each one of the sites to induce a local model. Those induced local models are collected and combined to form a global data mining model. In this work six different tactics are used for constructing the global model in DDM: Generalized Classifier Method (GCM); Specific Classifier Method (SCM); Weighed Classifier Method (WCM); Majority Voting Method (MVM); Model Sampling Method (MSM); and Centralized Training Method (CTM). Preliminary experimental tests were conducted with two synthetic data sets (eleven multiplexer and monks3) and a real world data set (intensive care medicine). The initial results demonstrate that the performance of DDM methods is competitive when compared with the CDM methods.