Speaker Diarization using Artificial Intelligence Techniques

The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accura...

Full description

Bibliographic Details
Main Author: Rosário, João Miguel Pinto Carrilho do (author)
Format: masterThesis
Language:eng
Published: 2020
Subjects:
Online Access:http://hdl.handle.net/10362/104277
Country:Portugal
Oai:oai:run.unl.pt:10362/104277
Description
Summary:The goal in Speaker Diarization (SD) is to answer the question "Who spoke when?" for a given audio where two or more people speak taking turns. This task becomes paramount for Automatic Speech Recognition (ASR) applications as it provides structured data that can improve recognition accuracy. Despite having been investigated for decades, diarization still remains an unsolved problem. Current State-of-the-Art methods focus on either designing probabilistic models such as Gaussian Mixture Models (GMM), where embeddings are extracted from feature matrices, or employing Deep Neural Networks such as Recurrent Neural Networks (RNN), that are capable of extracting relevant features to differentiate each speaker and provide richer embeddings. The proposed Speaker Diarization relies on three modules. The objective is to implement these so that the final system is able to generalize across different conditions and maintain efficiency. To this end, the first module partitions the input audio into even utterances and removes any silence that is present. This ensures that the information that is passed to the next module contains only relevant features of a single speaker. The purpose of the second module is to extract the speaker specific embeddings. Here, a recurrent neural network with Long-Short Time Memory (LSTM) cells is used. Different experiments, using different size networks, were conducted in order to better comprehend the benefits of recurrent neural networks. Finally, the third module applies a clustering algorithm to the extracted embeddings. At this stage, a comparison study was performed between four clustering algorithms (Spectral Clustering, DBSCAN, Hierarchical Clustering and K-Means). This provided useful insight on how each algorithm performed when applied to speech data. The results obtained for each individual module and the system as a whole were satisfactory. The final Speaker Diarization system achieved a Diarization Error Rate (DER) of 7.44% on a test partition from the dataset VoxCeleb2.