Towards an automated classification of spreadsheets

Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.We introduce with this paper a methodology to automatically classify spreadsheet...

Full description

Bibliographic Details
Main Author: Mendes, Jorge Cunha (author)
Other Authors: Do, Kha N. (author), Saraiva, João (author)
Format: conferencePaper
Language:eng
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/1822/70215
Country:Portugal
Oai:oai:repositorium.sdum.uminho.pt:1822/70215
Description
Summary:Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.We introduce with this paper a methodology to automatically classify spreadsheets into different domains. We exploit existing data mining classification algorithms using spreadsheet-specific features. The algorithms were trained and validated with cross-validation using the EUSES corpus, with an up to 89% accuracy. The best algorithm was applied to the larger Enron corpus in order to get some insight from it and to demonstrate the usefulness of this work.