In The Journal of the Acoustical Society of America
In this paper, an audio-driven, multimodal approach for speaker diarization in multimedia content is introduced and evaluated. The proposed algorithm is based on semi-supervised clustering of audio-visual embeddings, generated using deep learning techniques. The two modes, audio and video, are separately addressed; a long short-term memory Siamese neural network is employed to produce embeddings from audio, whereas a pre-trained convolutional neural network is deployed to generate embeddings from two-dimensional blocks representing the faces of speakers detected in video frames. In both cases, the models are trained using cost functions that favor smaller spatial distances between samples from the same speaker and greater spatial distances between samples from different speakers. A fusion stage, based on hypotheses derived from the established practices in television content production, is deployed on top of the unimodal sub-components to improve speaker diarization performance. The proposed methodology is evaluated against VoxCeleb, a large-scale dataset with hundreds of available speakers and AVL-SD, a newly developed, publicly available dataset aiming at capturing the peculiarities of TV news content under different scenarios. In order to promote reproducible research and collaboration in the field, the implemented algorithm is provided as an open-source software package.
Tsipas Nikolaos, Vrysis Lazaros, Konstantoudakis Konstantinos, Dimoulas Charalampos