Minimum classification error (MCE) rate training is a discriminative training method which seeks to minimize an empirical estimate of the error probability derived over a training set. The segmental generalized probab...
详细信息
In speaker recognition systems, frame selection, which aims at determining which frame is useful and which is not and selecting useful frames from the test utterance, can be utilized to increase recognition accuracy. ...
详细信息
ISBN:
(纸本)7801501144
In speaker recognition systems, frame selection, which aims at determining which frame is useful and which is not and selecting useful frames from the test utterance, can be utilized to increase recognition accuracy. In this paper, we present a new approach for frame selection using Log Likelihood Ratio (LLR), which is based on the idea that if a frame contains speaker information, the Log likelihood Score of the corresponding speaker model will be much larger than that of its competing model. As a result, for each frame we can calculate the Log Likelihood Ratio (LLR) between the largest score and the second largest score in different speaker models and take it as a reference: Those frames with a small LLR can be rejected and those with a large LLR can be kept. This algorithm is implemented based on a GMM-based text-independent speaker identification system. We compare the algorithm with another frame selection approach based on Jensen Difference (JD). Experiment shows that the approach using JD reduces the error by about 39.34%, while our approach using LLR reduces the error by about 46.32%.
We present a model-based noise compensation algorithm for robust speech recognition in nonstationary noisy environments. The effect of noise is split into a stationary part, compensated by parallel model combination, ...
详细信息
When the user has an accent different from what the automatic speech recognition system is trained with, the performance of the systems degrades. This is attributed to both acoustic and phonological differences betwee...
详细信息
ISBN:
(纸本)7801501144
When the user has an accent different from what the automatic speech recognition system is trained with, the performance of the systems degrades. This is attributed to both acoustic and phonological differences between accents. The phonological differences between two accents are due to different phoneme inventories in two languages. Even for the same phoneme, foreigners and native speakers pronounce different sounds. Since accented data is rare but monolingual data is abundant we propose using the accented speaker' s first language data directly instead of accented data in the second language for our purpose. We propose adapting the native English phoneme models to accented phoneme models using first language data in MLLR adaptation. The baseline performance is 35.25% (phone accuracy) in using native English phone models to recognize Cantoneseaccented English speech data. We compare accent adaptation by using accented data and source language data. On the average, using accented data for adaptation improves the phone accuracy by 69.98% while using source language data for adaptation improves the phone accuracy by 70.13%. This shows that both kinds of adaptation data give similar improvements. Therefore non-accented data can be used for adaptation. We can rapidly obtain an accent-adapted acoustic model without the need of collecting accented database.
In speaker recognition systems, frame selection, which aims at determining which frame is useful and which is not and selecting useful frames from the test utterance, can be utilized to increase recognition accuracy. ...
详细信息
In speaker recognition systems, frame selection, which aims at determining which frame is useful and which is not and selecting useful frames from the test utterance, can be utilized to increase recognition accuracy. In this paper, we present a new approach for frame selection using Log Likelihood Ratio (LLR), which is based on the idea that if a frame contains speaker information, the Log likelihood Score of the corresponding speaker model will be much larger than that of its competing model. As a result, for cach frame we can calculate the Log Likelihood Ratio (LLR) between the largest score and the second largest score in different speaker models and take it as a reference: Those frames with a small LLR can be rejected and those with a large LLR can be kept. This algorithm is implemented based on a GMM-hased text-independent speaker identification system. We compare the algorithm with another frame selection approach based on Jensen Difference (JD). Experiment shows that the approach using JD reduces the error by about 39.34%, while our approach using LLR reduces the error by about 46.32%.
Nowadays, almost all speaker-independent (SI) speech recognition systems use CDHMM with multivariate mixture Gaussian as observation density to cover speaker variabilities. It has been shown that given sufficient trai...
详细信息
Nowadays, almost all speaker-independent (SI) speech recognition systems use CDHMM with multivariate mixture Gaussian as observation density to cover speaker variabilities. It has been shown that given sufficient training data, the more mixtures are used in the HMM observation density, the better the system's perform. However, acoustic HMM with more Gaussian densities is more complex and slows down recognition speed. Another efficient way to handle speaker variation is to use speaker
When the user has an accent different from what the automatic speech recognization system is trained with, the performance of the systems degrades. This is attributed to both acoustic and phonological differences betw...
详细信息
When the user has an accent different from what the automatic speech recognization system is trained with, the performance of the systems degrades. This is attributed to both acoustic and phonological differences between accents. The phonological differences between two accents are due to different phoneme inventories in two languages. Even for the same phoneme, foreigners and native speukers pronounce different sounds. Since accented data is rare but monolingual data is abundant, we propose using the accented speaker's first language data directly instead of accented data in the second language for our purpose. We propose adapting the native English phoneme models to accented phoneme models using first language data in MLLR adaptation. The baseline performance is 35.25% (phone accuracy) in using native English phone models to recognize Cantonese-accented English speech data. We compare accent adaptation by using accented data and source language data. On the average, using accented data for adaptation improves the phone accuracy by 69.98% while using source language data for adaptation improves the phone accuracy by 70.13%. This shows that both kinds of adaptation data give similar improvements. Therefore non-accented data call be used for adaptation. We can rapidly obtain an accent-adapted acoustic model without the need of collecting accented database.
We present a model-based noise compensation algorithm for robust speech recognition in nonstationary noisy environments. The effect of noise is split into a stationary part, compensated by parallel model combination, ...
详细信息
We present a model-based noise compensation algorithm for robust speech recognition in nonstationary noisy environments. The effect of noise is split into a stationary part, compensated by parallel model combination, and a time varying residual. The evolution of residual noise parameters is represented by a set of state space models. The state space models are updated by Kalman prediction and the sequential maximum likelihood algorithm. Prediction of residual noise parameters from different mixtures are fused, and the fused noise parameters are used to modify the linearized likelihood score of each mixture. Noise compensation proceeds in parallel with recognition. Experimental results demonstrate that the proposed algorithm improves recognition performance in highly nonstationary environments, compared with parallel model combination alone.
Minimum classification error (MCE) rate training is a discriminative training method which seeks to minimize an empirical estimate of the error probability derived over a training set. The segmental generalized probab...
详细信息
Minimum classification error (MCE) rate training is a discriminative training method which seeks to minimize an empirical estimate of the error probability derived over a training set. The segmental generalized probabilistic descent (GPD) algorithm for MCE uses the log likelihood of the best path as a discriminant function to estimate the error probability. This paper shows that by using a discriminant function similar to the auxiliary function used in EM, we can obtain a "soft" version of GPD in the sense that information about all possible paths is retained. Complexity is similar to segmental GPD. For certain parameter values, the algorithm is equivalent to segmental GPD. By modifying the misclassification measure usually used, we can obtain an algorithm for embedded MCE training for continuous speech which does not require a separate N-best search to determine competing classes. Experimental results show error rate reduction of 20% compared with maximum likelihood training.
Construction of a recognizer in a new target language usually involves collection of a comprehensive database in that language as well as manual annotation and model training. For rapid development of new language rec...
详细信息
暂无评论