This paper presents a novel method for estimating formant frequencies and bandwidths based on an underlying vocal tract model. A novel statistical model for vocal tract cross-sectional areas is developed which allows ...
详细信息
This paper presents a novel method for estimating formant frequencies and bandwidths based on an underlying vocal tract model. A novel statistical model for vocal tract cross-sectional areas is developed which allows computation of full likelihood functions. Modifications to the basic particle filter algorithm have also been developed to help combat both diversity depletion and convergence problems. The performance of the method is evaluated against hand labeled formant database (L. Deng et al., 2003).
Some of the work on speech processing has focused on modeling speech as an AM-FM signal. The success of the AM-FM model motivated us to investigate a similar nonlinear model and examine its application in speaker iden...
详细信息
Some of the work on speech processing has focused on modeling speech as an AM-FM signal. The success of the AM-FM model motivated us to investigate a similar nonlinear model and examine its application in speaker identification. Tests are carried out to compare the performance of the novel cyclic correlation based method with popular speaker identification methods based on cepstra. These studies show that the performance of the proposed method is comparable to the cepstrum based approach at high signal-to-noise ratio, but the former outperforms the latter under noisy conditions.
In speech recognition, LPC cepstrum based on LPC or MFCC based on Mel-frequency filter bank are widely used as a feature extraction that determines the performance. However, these are not being regarded as the best fe...
详细信息
In speech recognition, LPC cepstrum based on LPC or MFCC based on Mel-frequency filter bank are widely used as a feature extraction that determines the performance. However, these are not being regarded as the best feature extraction. In this paper, we introduce a complex speech analysis for an analytic speech signal to HMM speech recognition. A complex speech analysis can estimate more accurate speech spectrum in low frequencies, as a result, it is expected that the speech analysis can perform well as a feature extractor in speech recognition. The MMSE-based time-varying complex AR speech analysis is adopted and the estimated complex parameters are converted to LPCCs and MFCCs as a feature vector for HTK (HMM tool kit) in order to realize the HMM speech recognition. Through continuous speech recognition experiments with the converted LPCCs and MFCCs, it was found that the complex speech analysis method would not perform well than the real one
Three different types of pole-zero modeling have been investigated. The main concept lying in these methods is to fit a high order pole predictor to the speech spectrum, and then to decompose the resulting predictor i...
详细信息
Three different types of pole-zero modeling have been investigated. The main concept lying in these methods is to fit a high order pole predictor to the speech spectrum, and then to decompose the resulting predictor into a pole predictor and a zero predictor. To obtain the predictor parameters by Padé approximation, either Trench algorithm or Berlekamp-Massey algorithm can be used for decomposition. In addition to Padé approximation method, the generalized inverse method and the modified Durbin's method are also considered for pole-zero modeling. Preliminary results show that the proposed methods yield better fitting of speech spectral envelope, especially of spectral nulls, than the all-pole model.
We explore the use of the multi-frame GMM-based block quantiser for quantising line spectral frequencies for wideband speech coding. Its main advantages over vector quantisers are bitrate scalability and bitrate indep...
详细信息
We explore the use of the multi-frame GMM-based block quantiser for quantising line spectral frequencies for wideband speech coding. Its main advantages over vector quantisers are bitrate scalability and bitrate independent complexity. By concatenating multiple frames together, interframe correlation can be exploited by the KLT (Karhunen-Loeve transform), leading to better quantisation. A saving of up to 3 bits/frame can be achieved by switching the quantiser from memoryless mode to jointly quantising two frames, with only a moderate increase in complexity. This quantisation scheme achieves lower spectral distortion than the split-multistage vector quantiser in the AMR-WB speech codec, with transparent coding at 37 bits/frame.
In this paper, we discuss the internet low bit rate codec (iLBC) with an emphasis on the frame-independent long-term prediction. The frame-independent long-term prediction is a method to exploit pitch-lag correlations...
详细信息
In this paper, we discuss the internet low bit rate codec (iLBC) with an emphasis on the frame-independent long-term prediction. The frame-independent long-term prediction is a method to exploit pitch-lag correlations in the encoding of speech without suffering multiple-frame speech degradation in connection with transmission loss. We present mean opinion scores for the iLBC codec and show by means of signal examples how the nature of degradation in a predictive codec based on frame-independent long-term prediction differs from that of traditional CELP codecs.
In this paper a new speech coding method will be presented, its base being the multipulse model of speech production. A closeness relationship between natural utterances and synthetic ones can be defined in terms of t...
详细信息
In this paper a new speech coding method will be presented, its base being the multipulse model of speech production. A closeness relationship between natural utterances and synthetic ones can be defined in terms of the speech signal itself: at some intervals, the coding error-that is to say the difference between natural and synthetic signals- is set to a particular value in order to meet pre-stablished signal-to-noise ratio criteria, whereas at some other intervals equality of the envelopes of both signals is the only constrain of the synthesis process. Thus defining the closeness constrains, the optimum multipulse sequence can be found through the minimization of the sum of the absolute values of its samples.
A vector quantization based talker recognition system is described and evaluated. The system is based on constructing highly efficient short-term spectral representations of individual talkers using vector quantizatio...
详细信息
A vector quantization based talker recognition system is described and evaluated. The system is based on constructing highly efficient short-term spectral representations of individual talkers using vector quantization codebook construction techniques. Although the approach is intrinsically text-independent, the system can be easily extended to text-dependent operation for improved performance and security by encoding specified training word utterances to form word prototypes. The system has been evaluated using a 100-talker database of 20,000 spoken digits. In a talker verification mode, average equal-error rate performance of 2.2% for text-independent operation and 0.3% for text-dependent operation is obtained for 7-digit long test utterances.
This paper introduces a new method for approximating the excitation signal in a linear predictive coding (LPC) model over both time and frequency domains. In the frequency domain, a new two-band excitation model is pr...
详细信息
This paper introduces a new method for approximating the excitation signal in a linear predictive coding (LPC) model over both time and frequency domains. In the frequency domain, a new two-band excitation model is proposed for decomposing the signal spectrum into two periodic and non-periodic components using the smoothed Wigner-Ville (SWV) distribution. This model provides high resolution in the time-frequency plane which gives accurate information about pitch and voicing. The periodic component is then analysed with a new "multi-pulse" like technique to approximate the signal in the time domain through repeating a representative pattern across each frame. It is shown that combining the spectral and the temporal models improves significantly over the performance of our earlier model for the excitation and gives a good approximation to the signal for most speech sounds. Evaluation results show that the proposed coding method operating at 2400 b/s outperforms current standard coders at similar rate.
Efficient quantization methods of the line spectrum pairs (LSP) which have good performances, low complexity and memory are proposed. The adaptive quantization method utilizing the ordering property of LSP parameters ...
详细信息
ISBN:
(纸本)0818679190
Efficient quantization methods of the line spectrum pairs (LSP) which have good performances, low complexity and memory are proposed. The adaptive quantization method utilizing the ordering property of LSP parameters is used in a scalar quantizer and a vector-scalar hybrid quantizer. The maximum quantization range of each LSP parameter is varied adaptively on the quantized value of the previous order's LSP parameter. The proposed scalar quantization algorithm needs 31 bits/frame which is 3 bits less than in the conventional scalar quantization method with interframe prediction to maintain the transparent quality of speech. The improved vector-scalar quantizer achieves an average spectral distortion of 1 dB using 26 bits/frame. The performances of proposed quantization methods are evaluated in the transmission errors.
暂无评论