Deep learning for activity recognition using audio and video

Neural networks have established themselves as powerhouses in what concerns several types of detection, ranging from human activities to their emotions. Several types of analysis exist, and the most popular and successful is video. However, there are other kinds of analysis, which, despite not being...

ver descrição completa

Detalhes bibliográficos
Autor principal: Reinolds, Francisco (author)
Outros Autores: Neto, Cristiana (author), Machado, José Manuel (author)
Formato: article
Idioma:eng
Publicado em: 2022
Assuntos:
Texto completo:https://hdl.handle.net/1822/78007
País:Portugal
Oai:oai:repositorium.sdum.uminho.pt:1822/78007
Descrição
Resumo:Neural networks have established themselves as powerhouses in what concerns several types of detection, ranging from human activities to their emotions. Several types of analysis exist, and the most popular and successful is video. However, there are other kinds of analysis, which, despite not being used as often, are still promising. In this article, a comparison between audio and video analysis is drawn in an attempt to classify violence detection in real-time streams. This study, which followed the CRISP-DM methodology, made use of several models available through PyTorch in order to test a diverse set of models and achieve robust results. The results obtained proved why video analysis has such prevalence, with the video classification handily outperforming its audio classification counterpart. Whilst the audio models attained on average 76% accuracy, video models secured average scores of 89%, showing a significant difference in performance. This study concluded that the applied methods are quite promising in detecting violence, using both audio and video.