Vocal rehabilitation devices used by patients after Laryngectomy produce an unnatural sounding speech. Our study aims at increasing the quality of these synthetically generated voices by implementing human-like charac...
详细信息
ISBN:
(纸本)9781538625163
Vocal rehabilitation devices used by patients after Laryngectomy produce an unnatural sounding speech. Our study aims at increasing the quality of these synthetically generated voices by implementing human-like characteristics. A simplified source filter model, linear predictive coding coefficients and line spectral frequencies were used to model the vocal tract and manipulate the acoustic features of their resulting speech. Two different mapping functions were employed to convert between the features of synthetically generated voice and those of a human voice: A Gaussian mixture model and a linear regression model. The models were trained on a set of 50 human and 50 synthetic voice utterances. Both mapping functions yielded significant changes in the transformed synthetic voices and their spectra were similar to the human voices. The linear regression model mapping produced slightly better results compared to the Gaussian mixture model mapping. Listeners' tests confirmed this result, but indicated that voices re-synthesized from the transformed model coefficients, improved on the synthetic voice but still sounded unnatural. This may imply that the vocal tract model is lacking in information that produces the subjective perception of "artificial speech". Future work will investigate an elaborate model which will include the speech production excitation and radiation signals and the transformation of their features. These models have the potential to improve the conversion of synthetically generated electrolarynx voice into human sounding one.
This paper compares three different features and various feature orders for the purpose of determining the best feature for gunshot detection under adverse noise condition. Compared features cover LPC, LPCC and MFCC w...
详细信息
ISBN:
(纸本)9781509045914
This paper compares three different features and various feature orders for the purpose of determining the best feature for gunshot detection under adverse noise condition. Compared features cover LPC, LPCC and MFCC with orders from 8 to 30. All features were extracted from sounds with the sound-to-noise ratios 30, 20, 10, and 0 dB. The background noise was simulated by white noise. Experimental results indicate that LPC coefficients are the most efficient features, especially for low noise. On the other hand, MFCC performed well in noisy environments at 10 dB and 20 dB.
Speech emotion recognition is one of the recent challenges in speech processing and Human Computer Interaction (HCI) in order to address various operational needs for the real world applications. Besides human facial ...
详细信息
ISBN:
(纸本)9781479983490
Speech emotion recognition is one of the recent challenges in speech processing and Human Computer Interaction (HCI) in order to address various operational needs for the real world applications. Besides human facial expressions, speech has been proven to be one of the most precious modalities for automatic recognition of human emotions. Speech is a spontaneous medium of perceiving emotions which provides in-depth information related to different cognitive states of a human being. In this context, a novel approach is being introduces using a combination of prosody features (i.e. pitch, energy, Zero crossing rate), quality features (i.e. Formant Frequencies, Spectral features etc.), derived features (i.e. Mel-Frequency Cepstral Coefficient (MFCC), linear predictive coding coefficients (LPCC)) and dynamic feature (Mel-Energy spectrum dynamic coefficients (MEDC)) for robust automatic recognition of speaker's state of emotion. Multilevel SVM classifier is used for identification of seven discrete emotional states namely anger, disgust, fear, happy, neutral, sad and surprise in 'Five native Assamese Languages'. The overall results of the conducted experiments revealed that the approach of using the combination of features achieved an average accuracy rate of 82.26% for speaker independent cases.
In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance...
详细信息
In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance degradation in noisy conditions or distorted channels. It is necessary to search for more robust feature extraction methods to gain better performance in adverse conditions. This paper investigates the performance of conventional and new hybrid speech feature extraction algorithms of Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction coding Coefficient (LPCC), perceptual linear production (PLP), and RASTA-PLP in noisy conditions through using multivariate Hidden Markov Model (HMM) classifier. The behavior of the proposal system is evaluated using TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing process. The theoretical basis for speech processing and classifier procedures were presented, and the recognition results were obtained based on word recognition rate.
Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the...
详细信息
Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral coefficients (MFCC), linear predictive coding coefficients (LPC), and Enhanced Mel-Frequency Cepstral coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1], we presented an experimental FPGA design and implementation of a novel architecture of a real-time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three sets of spectrograms 1) Mel-Frequency Cepstral coefficients (MFCC), 2) linear predictive coding coefficients (LPC), and 3) Enhanced Mel-Frequency Cepstral coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Figure 1. The three sets of speech features are extracted at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.
暂无评论