Speaker diarization systems attempt to assign temporal speech segments in a conversation to the appropriate speaker, and non-speech segments to non-speech. Speaker diarization systems basically provide an answer to th...
详细信息
Speaker diarization systems attempt to assign temporal speech segments in a conversation to the appropriate speaker, and non-speech segments to non-speech. Speaker diarization systems basically provide an answer to the question "Who spoke when ?". One inherent deficiency of most current systems is their inability to handle co-channel or overlapped speech. During the past few years, several studies have attempted dealing with the problem of overlapped or co-channel speech detection and separation, however, most of the algorithms suggested perform under unique conditions, require high computational complexity and require both time and frequency domain analysis of the audio data. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 60.0% of the frames labeled as overlapped speech by the baseline (ground-truth) segmentation , while keeping a 5% false-alarm rate.
A classifier for utterance rejection in a hidden Markov model (HMM) based speech recognizer is presented. This classifier, termed the two-pass classifier, is a postprocessor to the HMM recognizer, and consists of a tw...
详细信息
A classifier for utterance rejection in a hidden Markov model (HMM) based speech recognizer is presented. This classifier, termed the two-pass classifier, is a postprocessor to the HMM recognizer, and consists of a two-stage discriminant analysis. The first stage employs the generalized probabilistic descent (GPD) discriminative training framework, while the second stage performs linear discrimination combining the output of the first stage with HMM likelihood scores. In this fashion the classification power of the HMM is combined with that of the GPD stage which is specifically designed for keyword/nonkeyword classification. Experimental results show that, on two separate databases, the two-pass classifier significantly outperforms a single-pass classifier based solely on the HMM likelihood scores.< >
In an effort to provide a more efficient representation of the acoustical speech signal in the pre classification stage of a speech recognition system, we consider the application of the Best-Basis Algorithm of R.R. C...
详细信息
In an effort to provide a more efficient representation of the acoustical speech signal in the pre classification stage of a speech recognition system, we consider the application of the Best-Basis Algorithm of R.R. Coifman and M.L. Wickerhauser (1992). This combines the advantages of using a smooth, compactly supported wavelet basis with an adaptive time scale analysis, dependent on the problem at hand. We start by briefly reviewing areas within speech recognition where the wavelet transform has been applied with some success. Examples include pitch detection, formant tracking, phoneme classification. Finally, our wavelet based feature extraction system is described and its performance on a simple phonetic classification problem given.
We present in this paper a new binomial sine pulse (BSP) excitation signal used in linear prediction-based speech codecs. The structure of the BSP excitation signal is actually a sine wave whose amplitude is modulated...
详细信息
We present in this paper a new binomial sine pulse (BSP) excitation signal used in linear prediction-based speech codecs. The structure of the BSP excitation signal is actually a sine wave whose amplitude is modulated by a binomial signal. The binomial signal describes the various trends of excitation signals in a pitch period, and the pulsatance of the BSP excitation signal coincides with the vibration frequency of vocal folds. In experiments, processing is going on frame by frame and the same excitation signal is placed at every pitch excitation moment in a frame. Speech codecs based on this new BSP excitation have the advantages of low complexity and low delay. Experiment results prove that such a new speech codec can provide highly intelligible synthesized speech below 3 kbps.
An algorithm is presented for isolated-word recognition, taking into consideration the duration variability of the different utterances of the same word. The algorithm is based on extracting acoustical features from t...
详细信息
An algorithm is presented for isolated-word recognition, taking into consideration the duration variability of the different utterances of the same word. The algorithm is based on extracting acoustical features from the speech signal and using them as the input to multilayer perceptrons neural networks. The backpropagation algorithm is used to train the networks. The hidden Markov model (HMM) is implemented to extract temporal features (states) from the speech signal. The input vector to the network consists of 16 cepstral coefficients, two delta cepstral coefficients, and five elements to represent the state. The networks are trained to recognize the correct words and to reject the wrong words. The training set consists of ten words (digit zero to digit nine), each uttered seven times, by three different speakers. The test set consists of three utterances of each of the ten words. The authors' results show the ability to recognize all of these words.< >
A new approach to temporal decomposition (TD) of speech, called "spectral stability based event localizing temporal decomposition", abbreviated S/sup 2/ BEL-TD, is presented. The original method of TD propos...
详细信息
A new approach to temporal decomposition (TD) of speech, called "spectral stability based event localizing temporal decomposition", abbreviated S/sup 2/ BEL-TD, is presented. The original method of TD proposed by Atal (1983) is known to have the drawbacks of high computational cost, and the instability of the number and locations of events. In S/sup 2/ BEL-TD, the event localization is performed based on a maximum spectral stability criterion. This overcomes the instability problem of events of the Atal's method. Also, S/sup 2/ BEL-TD avoids the use of the computationally costly singular value decomposition routine used in the Atal's method, thus resulting in a computationally simpler algorithm of TD. Simulation results show that an average spectral distortion of about 1.5 dB can be achieved with LSF as the spectral parameter. Also, we have shown that the temporal pattern of the speech excitation parameters can also be well described using the S/sup 2/ BEL-TD technique.
In high dimensional feature space with finite samples, severe bias can be introduced in the nearest neighbor algorithm. In this paper, we propose a new classification method, which performs classification task based o...
详细信息
ISBN:
(纸本)0769525210
In high dimensional feature space with finite samples, severe bias can be introduced in the nearest neighbor algorithm. In this paper, we propose a new classification method, which performs classification task based on local probability center of each class. Moreover, this prototype-based method classifies the query sample by using two measures, one is the distance between query and local probability centers, the other is the posterior probability of query. Although both measures are effect, the experiments show the second one is the better. The investigation results prove that this method improves the classification performance of nearest neighbor algorithm substantially
In the paper problems related to the classification of singing voice quality are presented. For this purpose a database consisting of singers' sample recordings is constructed and parameters are extracted from rec...
详细信息
In the paper problems related to the classification of singing voice quality are presented. For this purpose a database consisting of singers' sample recordings is constructed and parameters are extracted from recorded voice of trained and untrained singers. The parameterization process is based on both voice source and formant analysis of a singing voice. These parameters are explained as to their physical interpretation and analyzed statistically in order to diminish their number. The statistical analysis is based on the Fisher statistic. In such a way a feature vector of a singing voice is formed. Decision systems based on neutral networks and rough sets are utilized in the context of the voice type and voice quality classification. Results obtained in the automatic classification performed by both decision systems are compared. A possibility to classify automatically type/quality of voice is judged. The methodology proposed provides means for discerning trained and untrained singers.
An analysis and simulation results are presented comparing the performance of several types of high-order backward adaptive predictors with orders up to 100. Issues in high-order linear predictive coding (LPC) analysi...
详细信息
An analysis and simulation results are presented comparing the performance of several types of high-order backward adaptive predictors with orders up to 100. Issues in high-order linear predictive coding (LPC) analysis, such as analysis methods, windowing, ill-conditioning, quantization noise effects, and computational complexity, are studied. The performance of the various analysis methods is compared with the conventional sequential formant-pitch predictor. The auto-correlation method (50th order) shows performance advantages over the sequential formant-pitch configurations. Several new backward high-order methods using covariance analysis and a lattice formulation show much better prediction gains than the auto-correlation method.< >
暂无评论