TY - JOUR
T1 - Channel and channel subband selection for speaker diarization
AU - Ahmed, Ahmed Isam
AU - Chiverton, John P.
AU - Ndzi, David L.
AU - Al-Faris, Mahmoud M.
N1 - Funding Information:
The first author would like to thank the Higher Committee for Education Development in Iraq for funding his PhD study at the University of Portsmouth.
Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/9/1
Y1 - 2022/9/1
N2 - Speaker diarization can be considered to be one of the complex problems in speaker recognition. A reliable diarization system should be able to accurately determine the variable length utterances which a speaker contributes to multi-speaker conversations. This is a difficult problem since text-independent speaker identification and verification is yet to be improved for it to be applied reliably. While efficient speaker modelling is important for diarization, the acoustical representation of speech is the basic entity that signifies a speaker. This representation should be outstanding enough to prevent a speaker's utterances from being lost in the acoustical congestion that is imposed by the rest of the talkers. For this purpose, it is proposed here, for the case of multiple-microphone diarization, multiple speech signals are used in the acoustic feature extraction instead of combining the signals beforehand. The reason is to make an optimal use of those signals in order to enrich the quality of the acoustical representation of the speaker. To this end, and since not all microphone signals (channels) may be desirable, two selection approaches are proposed in this work. These are, a best quality channel selection method and a novel approach for diverse channel selection. Furthermore, a novel method is proposed which retains the speech spectrum from selected least reverberated subbands of the available channels’ spectrums. A new model, referred to here as Averaged Joint Gradient (AJG), is introduced for this purpose. The proposed approach reduces the Diarization Error Rate (DER) in both of the diarization systems used in the evaluations. The first system is based on binary keys and achieves a maximum relative reduction in DER of 14%. The second one is a Gaussian Mixture Model-Bayesian Information Criterion (GMM-BIC) based system which achieves a maximum relative reduction in DER of 20%.
AB - Speaker diarization can be considered to be one of the complex problems in speaker recognition. A reliable diarization system should be able to accurately determine the variable length utterances which a speaker contributes to multi-speaker conversations. This is a difficult problem since text-independent speaker identification and verification is yet to be improved for it to be applied reliably. While efficient speaker modelling is important for diarization, the acoustical representation of speech is the basic entity that signifies a speaker. This representation should be outstanding enough to prevent a speaker's utterances from being lost in the acoustical congestion that is imposed by the rest of the talkers. For this purpose, it is proposed here, for the case of multiple-microphone diarization, multiple speech signals are used in the acoustic feature extraction instead of combining the signals beforehand. The reason is to make an optimal use of those signals in order to enrich the quality of the acoustical representation of the speaker. To this end, and since not all microphone signals (channels) may be desirable, two selection approaches are proposed in this work. These are, a best quality channel selection method and a novel approach for diverse channel selection. Furthermore, a novel method is proposed which retains the speech spectrum from selected least reverberated subbands of the available channels’ spectrums. A new model, referred to here as Averaged Joint Gradient (AJG), is introduced for this purpose. The proposed approach reduces the Diarization Error Rate (DER) in both of the diarization systems used in the evaluations. The first system is based on binary keys and achieves a maximum relative reduction in DER of 14%. The second one is a Gaussian Mixture Model-Bayesian Information Criterion (GMM-BIC) based system which achieves a maximum relative reduction in DER of 20%.
KW - Acoustic beamforming
KW - Channel selection
KW - Reverberation
KW - Speaker diarization
UR - http://www.scopus.com/inward/record.url?scp=85126614567&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2022.101367
DO - 10.1016/j.csl.2022.101367
M3 - Article
AN - SCOPUS:85126614567
SN - 0885-2308
VL - 75
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101367
ER -