Towards an automated classification of spreadsheets

Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.We introduce with this paper a methodology to automatically classify spreadsheet...

ver descrição completa

Detalhes bibliográficos
Autor principal: Mendes, Jorge Cunha (author)
Outros Autores: Do, Kha N. (author), Saraiva, João (author)
Formato: conferencePaper
Idioma:eng
Publicado em: 2016
Assuntos:
Texto completo:http://hdl.handle.net/1822/70215
País:Portugal
Oai:oai:repositorium.sdum.uminho.pt:1822/70215
Descrição
Resumo:Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.We introduce with this paper a methodology to automatically classify spreadsheets into different domains. We exploit existing data mining classification algorithms using spreadsheet-specific features. The algorithms were trained and validated with cross-validation using the EUSES corpus, with an up to 89% accuracy. The best algorithm was applied to the larger Enron corpus in order to get some insight from it and to demonstrate the usefulness of this work.