Summary: | Human activity recognition algorithms have been increasingly sought due to their broad application, in areas such as healthcare, safety and sports. Current works focusing on human activity recognition are based majorly on Supervised Learning algorithms and have achieved promising results. However, high performance is achieved at the cost of a large amount of labelled data required to train and learn the model parameters, where a high volume of data will increase the algorithm’s performance and the classifier’s ability to generalise correctly into new, and previously unseen data. Commonly, the labelling process of ground truth data, which is required for supervised algorithms, must be done manually by the user, being tedious, time-consuming and difficult. On this account, we propose a Semi-Supervised Active Learning technique able to partly automate the labelling process and reduce considerably the labelling cost and the labelled data volume required to obtain a highly performing classifier. This is achieved through the selection of the most relevant samples for annotation and propagation of their label to similar samples. In order to accomplish this task, several sample selection strategies were tested in order to find the most valuable sample for labelling to be included in the classifier’s training set and create a representative set of the entire dataset. Followed by a semi-supervised stage, labelling with high confidence unlabelled samples, and augmenting the training set without any extra labelling effort from the user. Lastly, five stopping criteria were tested, optimising the trade-off between the classifier’s performance and the percentage of labelled data in its training set. Experimental results were performed on two different datasets with real data, allowing to validate the proposed method and compare it to literature methods, which were replicated. The developed model was able to reach similar accuracy values as supervised learning, with a reduction in the required labelled data of more than 89% for the two datasets, respectively.
|