AbstractA number of enhancements on the front-end of i-vector based speaker verification and binary key based speaker diarization are introduced. This is achieved by tackling the methods of acoustic feature extraction and feature combination and by proposing a source selection of the speech signal and spatial feature transformation for speaker diarization. A new paradigm for the extraction of the Mel-Frequency Cepstral Coefficients (MFCC) speech features is introduced and it is based on determining the cepstral coefficients from suitably selected subsets of the filters in the filter bank. The extraction of the Linear Predictive Cepstral Coefficients (LPCC) is also tackled by having the required estimation of the autocorrelation function approximated as the inverse of the smoothened multitaper spectral estimates.
A Recurrent Neural Network (RNN) based weighted Principal Component Analysis
(PCA) approach is introduced for feature fusion in addition to dimensionality reduction. This RNN based approach provides an eigendecomposition of weighted correlation and covariance matrices. This weighted PCA is found to provide a solution that can be robust to outliers and to be an efficient method for weighted-feature fusion.
Two selection approaches of multiple microphones’ signals (channel selection) are proposed for speaker diarization in a meeting scenario. One method selects the most diverse signals based on the spatial diversity of the microphones. The second method selects the best quality signals with reference to a signal obtained by combining all of the signals using the beamforming technique. Additionally, a selection of the least reverberated subbands (of microphones’ signals) is proposed and it is based on the estimation of the mean gradient of the spectrum of the speech frames. This is found to provide comparable improvements to the case when features are extracted from selected channels but at a lower feature dimensionality.
An analysis is conducted to identify the reasons preventing the binary key based diarization system from operating on spatial features. Depending on the analysis results, a nonlinear transformation of these features is found to be required to enable their integration into this system which noticeably improves the diarization accuracy. Additionally, as opposed to the uniform initialisation method usually used by this diarization system, six non-uniform initialisation methods are proposed and investigated.
|Date of Award||Jul 2019|
|Supervisor||John Chiverton (Supervisor) & David Ndzi (Supervisor)|