A warped filter is presented as a new speech enhancement method to adjust formant bandwidths on a critical band scale. The warped filter enhances perceived loudness without adding signal energy by exploiting the psych...
详细信息
A warped filter is presented as a new speech enhancement method to adjust formant bandwidths on a critical band scale. The warped filter enhances perceived loudness without adding signal energy by exploiting the psychoacoustic nature of the auditory system. The critical band concept in auditory theory states that when the energy in a signal remains constant, loudness increases when the energy spreads beyond a critical bandwidth. A warped filter is proposed and developed to elevate the perceived loudness of clean speech by applying nonlinear bandwidth expansion to the formant regions of vowels in accordance with the critical band scale. The filter has been inspired and motivated by the biological representation of loudness in the peripheral auditory system and the critical band concept of hearing.
In this paper using neural network models we demonstrate the presence of complementary speaker-specific information in the residual phase as compared to the conventional spectral features. The spectral features mainly...
详细信息
In this paper using neural network models we demonstrate the presence of complementary speaker-specific information in the residual phase as compared to the conventional spectral features. The spectral features mainly represent the speaker-specific vocal tract system features. The proposed LP residual phase represents the speaker-specific excitation source information. Speaker recognition studies are conducted using NIST 2003 speaker recognition evaluation database. The speaker recognition system using only spectral features gives an equal error rate (EER) of 15.5% and using only LP residual phase information gives an EER of 22.0%. However, combining the evidences from LP residual phase and spectral features increases the performance to an EER of 13.5%. This result clearly demonstrates the complementary nature of speaker-specific information present in the LP residual phase.
The aim of acquiring knowledge about the emotional state of a speaker is to improve the robustness of speech recognition systems, as the mechanisms producing speech vary in the presence of emotions, and also to improv...
详细信息
The aim of acquiring knowledge about the emotional state of a speaker is to improve the robustness of speech recognition systems, as the mechanisms producing speech vary in the presence of emotions, and also to improve the machine's perception of a speaker's emotional state so as to respond to his/her requests more appropriately. The paper proposes an approach based on genetic algorithms to determine a set of features that will allow robust classification of positive and negative emotional states. Starting from a vector of 414 features, a subset of features is obtained providing a good discrimination between positive and negative slates, while maintaining low computational complexity
We evaluate smoothing within the context of the MVA (mean subtraction, variance normalization, and ARMA filtering) post-processing scheme for noise-robust automatic speech recognition. MVA has shown great success in t...
详细信息
We evaluate smoothing within the context of the MVA (mean subtraction, variance normalization, and ARMA filtering) post-processing scheme for noise-robust automatic speech recognition. MVA has shown great success in the past on the Aurora 2.0 and 3.0 corpora, even though it is computationally inexpensive. MVA is applied to many acoustic feature extraction methods, and is evaluated using Aurora 2.0. We evaluate MVA post-processing on MFCCs, LPCs, PLPs, RASTA, Tandem, modulation-filtered spectrogram, and modulation cross-correlogram features. We conclude that, while effectiveness does depend on the extraction method, the majority of features benefit significantly from MVA, and the smoothing ARMA filter is an important component. It appears that the effectiveness of normalization and smoothing depends on the domain in which it is applied, being most fruitfully applied just before being scored by a probabilistic model. Moreover, since it is both effective and simple, our ARMA filter should be considered a candidate method in most noise-robust speech recognition tasks.
linear predictive coding (LPC) has been used to compress and encode speech signals for digital transmission at a low bit rate. PARCOR parameter associated with LPC that represents a vocal tract model based on a lattic...
详细信息
linear predictive coding (LPC) has been used to compress and encode speech signals for digital transmission at a low bit rate. PARCOR parameter associated with LPC that represents a vocal tract model based on a lattice filter structure is considered for speech recognition. The use of FIR coefficients and the frequency response of AR model were previously investigated. This paper reports a method to detect syllables from a continuous stream of speech. The system being developed slides a time window of 20 ms and calculates the PARCOR parameters continuously, feeding them to a syllable classifier. The syllable classifier is a supervised classifier that requires training. The training uses TIMIT speech database, which contains the recordings of 630 speakers of 8 major dialects of American English. The voiced/unvoiced switch built into the LPC vocoder was modified to segment words included in the speech records. Preliminary results of classification are presented in the paper
As broadband IP network is commonly used in home and office the demand of IP telephone is increasing more and more. On IP telephone packet loss occurs, as a result, speech quality is degraded seriously. Hence, packet ...
详细信息
As broadband IP network is commonly used in home and office the demand of IP telephone is increasing more and more. On IP telephone packet loss occurs, as a result, speech quality is degraded seriously. Hence, packet loss concealment (PLC) is required in IP telephone. G.711 appendix I PLC algorithm has been recommended by ITU-T in 1999 and it is widely used in IP telephone. The performance of the G.711 PLC scheme is not sufficient, thus, we have already developed the improved algorithm using LPC analysis and synthesis scheme. In the improved method past speech is divided into LPC parameters and residual signal and lost residual signal is recovered by the G.711 PLC scheme and lost speech signal is recovered by synthesizing with the repeated LPC parameters and recovered residual signal. In this paper sinusoidal model is introduced to predict the lost residual signal more accurately. The estimation and prediction performance are evaluated by objective measure and the prediction is embedded in the improved G.711 PLC method
We study the performance of an FIR cascade structure for adaptive linear prediction, in which each FIR filter stage is independently adapted using an LMS algorithm. The performance bound is derived for the cascade LMS...
详细信息
We study the performance of an FIR cascade structure for adaptive linear prediction, in which each FIR filter stage is independently adapted using an LMS algorithm. The performance bound is derived for the cascade LMS predictor under some assumptions. We discover that it is possible for this bound to be better than that of the linear predictive coding (LPC) technique using the block-based Levinson-Durbin algorithm if the cascade LMS predictor is well selected. We show some examples of this bound for synthetic and real audio signals in which a cascade LMS predictor outperforms the LPC technique using the Levinson-Durbin algorithm.
This work proposes a novel method of predicting formant frequencies from a stream of mel-frequency cepstral coefficients (MFCC) feature vectors. Prediction is based on modelling the joint density of MFCCs and formant ...
详细信息
This work proposes a novel method of predicting formant frequencies from a stream of mel-frequency cepstral coefficients (MFCC) feature vectors. Prediction is based on modelling the joint density of MFCCs and formant frequencies using a Gaussian mixture model (GMM). Using this GMM and an input MFCC vector, two maximum a posteriori (MAP) prediction methods are developed. The first method predicts formants from the closest, in some sense, cluster to the input MFCC vector, while the second method takes a weighted contribution of formants predicted from all clusters. Experimental results are presented using the ETSI Aurora connected digit database and show that predicted formant frequencies are within 3.2% of reference formant frequencies.
暂无评论