With increasing demand of wireless and Internet accesses, the interoperability crossing these two networks becomes increasingly important for modern communications. A transcoding system that translates the coding para...
详细信息
With increasing demand of wireless and Internet accesses, the interoperability crossing these two networks becomes increasingly important for modern communications. A transcoding system that translates the coding parameters directly between the GSM and G.729 speech coders is discussed in the present paper. The GSM coder is used for mobile communications while the G.729 is the most favorable speech coder in Internet communications. For Internet/wireless gateway servers, it is evident that the proposed transcoding system requires much less computation than the conventional decode-then-encode (DTE) approach.
Speaker verification systems are basically composed of three stages: feature extraction, feature processing and comparison of the modified features from speaker voice and from the voice that should be verified. Many f...
详细信息
Speaker verification systems are basically composed of three stages: feature extraction, feature processing and comparison of the modified features from speaker voice and from the voice that should be verified. Many features have been used in the first stage, although the current literature has not already shown the best of them. Based on the biometrics hypothesis, which states that each individual has a physical characteristic that distinguishes itself from the others, this paper realized a comparison between 12 classical widely used parameters, in order to investigate the biometrics hypothesis. The obtained results point out those parameters directly correlated to speaker's anatomy which are among the best ones that can be used in the development of speaker verification systems.
The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of t...
详细信息
The purpose of a voice conversion (VC) system is to change the perceived speaker identity of a speech signal. We propose an algorithm based on converting the LPC spectrum and predicting the residual as a function of the target envelope parameters. We conduct listening tests based on speaker discrimination of same/difference pairs to measure the accuracy by which the converted voices match the desired target voices. To establish the level of human performance as a baseline, we first measure the ability of listeners to discriminate between original speech utterances under three conditions: normal, fundamental frequency and duration normalized, and LPC coded. Additionally, the spectral parameter conversion function is tested in isolation by listening to source, target, and converted speakers as LPC coded speech. The results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech. However, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech.
In the paper, a real-time 64-mono-syllable recognition LSI is presented. The LSI accepts 11.6 ms speech frame and outputs a 6-bit symbol-code for each frame by the end of the next frame in a pipelining manner. The rec...
详细信息
ISBN:
(纸本)0780366336
In the paper, a real-time 64-mono-syllable recognition LSI is presented. The LSI accepts 11.6 ms speech frame and outputs a 6-bit symbol-code for each frame by the end of the next frame in a pipelining manner. The recognition method is based on the Hidden Markov Model (HMM) and is speaker-independent. An on-chip learning mechanism has also been designed, but the circuit is off-chip for the present implementation because of the restriction of LSI area. The LSI is fabricated by VDEC Rohm with a 0.6 /spl mu/m CMOS process on a 4.5 mm/spl times/4.5 mm chip.
The speech wave passes through the glottis and the vocal tract and finally arrives at the lips. The shapes of these organs modify the sounds we hear. A "tube model" can represent the vocal tract, and the var...
详细信息
The speech wave passes through the glottis and the vocal tract and finally arrives at the lips. The shapes of these organs modify the sounds we hear. A "tube model" can represent the vocal tract, and the various cross-sections progressing from the glottis to the lips characterize differing sounds. Such cross-sections may be linked to patient state. This may be NORMAL or ABNORMAL, damaged by cancer or unfortunate surgery or just old age. The author is interested in using vowel sounds as a possible diagnostic tool. Cross-sections of the vocal tract model can be calculated from the constants of a linear prediction coder, LPC, trained with a suitable vowel. The paper shows that an artificial neural network version of the LPC gives lower error than the original which should make it more useful for calculating cross-sectional areas. Two stages of fuzzy variables are used to diagnose a patient into two levels, NORMAL or ABNORMAL. Further analysis reinforces the diagnosis by using adverbs LIKELY or UNLIKELY to end up with one of four categories.
This paper presents a simple and efficient time domain technique to estimate an all-pole model on the mel-frequency scale (mel-LPC), and compares the recognition performance of the mel-LPC cepstrum with those of both ...
详细信息
This paper presents a simple and efficient time domain technique to estimate an all-pole model on the mel-frequency scale (mel-LPC), and compares the recognition performance of the mel-LPC cepstrum with those of both the standard LPC mel-cepstrum and the MFCC (mel-frequency cepstral coefficient) through the Japanese dictation system (Julius) with 20,000 word vocabulary. First, the optimal value of the frequency warping factor is examined in terms of monosyllable accuracy. When using the optimal warping factors, the mel-LPC cepstrum attains word accuracies of 93.0% for male speakers and 93.1% for female speakers, which are 2.1% and 1.7% higher than those of the LPC mel-cepstrum, respectively. Furthermore, this performance is slightly superior to that of MFCC.
This paper proposes a new method, using neural networks, of adapting phone HMMs to noisy speech. The neural networks are designed to map clean speech HMMs to noise-adapted HMMs, using noise HMMs and signal-to-noise ra...
详细信息
This paper proposes a new method, using neural networks, of adapting phone HMMs to noisy speech. The neural networks are designed to map clean speech HMMs to noise-adapted HMMs, using noise HMMs and signal-to-noise ratios (SNRs) as inputs, and are trained to minimize the mean square error between the output HMMs and the target noise-adapted HMMs. In evaluation, the proposed method was used to recognize noisy broadcast-news speech in speaker-dependent and -independent modes. The trained networks were confirmed to be effective in recognizing new speakers under new noise and various SNR conditions.
We present the ARMA lattice model for speech synthesis. By adopting a pole-zero approach, this model overcomes the limitation of the absence of zeros in the LPC model, which is based entirely on an all-pole AR model f...
详细信息
ISBN:
(纸本)0780370805
We present the ARMA lattice model for speech synthesis. By adopting a pole-zero approach, this model overcomes the limitation of the absence of zeros in the LPC model, which is based entirely on an all-pole AR model for the representation of the vocal tract. Therefore, a more natural synthesized speech can be achieved using this ARMA approach, especially in speech signals that contain nasal, fricative and plosive sounds. The structure of the ARMA lattice model is numerically stable, which is desirable for robust speech synthesis applications. Simulation results indicate that the quality of the synthesized speech of spoken sentences is encouraging, and a better result can be achieved by optimizing the parameters to be used.
The ACELP method makes use of multipulse structure to represent the excitation pulses of residual signal. With the purpose of computational complexity reduction, this paper provides the maximum-take-precedence ACELP (...
详细信息
The ACELP method makes use of multipulse structure to represent the excitation pulses of residual signal. With the purpose of computational complexity reduction, this paper provides the maximum-take-precedence ACELP (MTP-ACELP) search method under the acceptable degradation in performance. Because the maximum of target signal is preferentially compensated, the degradation of performance would be diminished. By predicting the locations of pulses, the computational complexity would be reduced. We not only reduce the possible pulse combinations in the search procedure but also avoid the computation of useless correlation functions before the search procedure. Furthermore, the proposed method is compatible to any ACELP type vocoder, e.g. the G.723.1, G.729, GSM- EFR standards.
暂无评论