Summary: | Malware classification can be a challenge considering the great amount of variety and increasing emergence of malware, as well as, available classification methods. For this reason, it is not unusual for a file to be considered a different type of malicious file by different classifiers. In fact, an assignment made by a single classifier might change through time, as a consequence of methods refinements or new discoveries. When using multiple independent classifiers, past classifications of a certain file might help on deciding on which one to trust. This dissertation aims at finding a way to facilitate this analysis by collecting historical data on files that already have assigned their final and last classification, and determine which machine learning algorithm can better predict a new file classification given this very same data. Besides the historical data, other characteristics shall be taken into account like: source of the file, filetype and filesize. The machine learning algorithms we have used are: C4.5, Random Forests, Multi-Layer Perceptron (MLP) and Long short-term memory (LSTM). It was possible with this approach to find an alternative way in finding the correct malware classification of a file, given a multiple number of classifiers, taking into account its classification history.
|