The validity of glottal inverse filtering (GIF) to obtain a glottal flow waveform from radiated pressure signal in the presence and absence of source-filter interaction was studied systematically. A driven vocal fold ...
详细信息
The validity of glottal inverse filtering (GIF) to obtain a glottal flow waveform from radiated pressure signal in the presence and absence of source-filter interaction was studied systematically. A driven vocal fold surface model of vocal fold vibration was used to generate source signals. A one-dimensional wave reflection algorithm was used to solve for acoustic pressures in the vocal tract. Several test signals were generated with and without source-filter interaction at various fundamental frequencies and vowels. linear predictive coding (LPC), Quasi Closed Phase (QCP), and Quadratic Programming (QPR) based algorithms, along with supraglottal impulse response, were used to inverse filter the radiated pressure signals to obtain the glottal flow pulses. The accuracy of each algorithm was tested for its recovery of maximum flow declination rate (MFDR), peak glottal flow, open phase ripple factor, closed phase ripple factor, and mean squared error. The algorithms were also tested for their absolute relative errors of the Normalized Amplitude Quotient, the Quasi-Open Quotient, and the Harmonic Richness Factor. The results indicated that the mean squared error decreased with increase in source-filter interaction level suggesting that the inverse filtering algorithms perform better in the presence of source-filter interaction. All glottal inverse filtering algorithms predicted the open phase ripple factor better than the closed phase ripple factor of a glottal flow waveform, irrespective of the source-filter interaction level. Major prediction errors occurred in the estimation of the closed phase ripple factor, MFDR, peak glottal flow, normalized amplitude quotient, and Quasi-Open Quotient. Feedback-related nonlinearity (source-filter interaction) affected the recovered signal primarily when f(o) was well below the first formant frequency of a vowel. The prediction error increased when f(o) was close to the first formant frequency due to the difficulty of estimating th
Traditional pitch-excited linear predictive coding (LPC) vocoders use a fully parametric model to efficiently encode the important information in human speech. These vocoders can produce intelligible speech at low dat...
详细信息
Traditional pitch-excited linear predictive coding (LPC) vocoders use a fully parametric model to efficiently encode the important information in human speech. These vocoders can produce intelligible speech at low data rates (800-2400 b/s), but they often sound synthetic and generate annoying artifacts such as buzzes, thumps, and tonal noises. These problems increase dramatically if acoustic background noise is present at the speech input. This paper presents a new mixed excitation LPC vocoder model that preserves the low bit rate of a fully parametric model but adds more free parameters to the excitation signal so that the synthesizer can mimic more characteristics of natural human speech. The new model also eliminates the traditional requirement for a binary voicing decision so that the vocoder performs well even in the presence of acoustic background noise. A 2400-b/s LPC vocoder based on this model has been developed and implemented in simulations and in a real-time system. Formal subjective testing of this coder confirms that it produces natural sounding speech even in a difficult noise environment. In fact, diagnostic acceptibility measure (DAM) test scores show that the performance of the 2400-b/s mixed excitation LPC vocoder is close to that of the government standard 4800-b/s CELP coder.
Of great importance to the success of the articulatory approach to speech coding is the use of a good distortion measure between a given speech signal and the entries in a stored codebook of impulse responses and corr...
详细信息
Of great importance to the success of the articulatory approach to speech coding is the use of a good distortion measure between a given speech signal and the entries in a stored codebook of impulse responses and corresponding vocaltract shapes (articulatory codebook). One promising distortion measure is the weighted cepstral distortion. Since the impulse responses in the articulatory codebook do not include glottal characteristics, we derive optimal weighting functions (cepstral lifters) to reduce the influence of a varying glottal source on the cepstral distortion measure. This is done by examining the ensemble of cepstral coefficients of speech produced by an articulatory speech synthesizer that also includes a vocal-cord model. The obtained cepstral lifters are optimal for the given ensemble of cepstral coefficients and for given constraints on the weighting function. They are different for cepstral coefficients derived from the power spectrum (FFT cepstra) and for those derived from LPC coefficients (LPC cepstra). The performances of the obtained cepstral lifters are compared in an articulatory codebook search.
This paper presents novel techniques for source-controlled variable-rate wideband speech coding. These techniques have been used in the variable-rate multimode wideband (VMR-WB) speech codec recently selected by the T...
详细信息
This paper presents novel techniques for source-controlled variable-rate wideband speech coding. These techniques have been used in the variable-rate multimode wideband (VMR-WB) speech codec recently selected by the Third-Generation Partnership Project 2 (3GPP2) for wideband (WB) speech telephony, streaming, and multimedia messaging services in the cdma2000 third-generation wireless system. The codec utilizes efficient coding modes optimized for different classes of speech signal including generic coding based on AMR-WB for transients and onsets, voiced coding optimized for stable voiced signals, unvoiced coding optimized for unvoiced segments, and comfort noise generation for inactive segments. Several innovations enable very good performance at average bit rates below 8 kb/s for active speech coding. The article presents an overview of the codec and describes in detail some of the codec novel features: Robust pitch tracking algorithm, coding-mode dependent prediction of linear prediction (LP) filter quantization, and novel frame erasure concealment techniques including supplementary information for reconstruction of lost onsets and improving decoder convergence. Selected results from the Selection and Characterization tests of the codec illustrate its performance.
This paper considers the design and implementation of a linearpredictive (LPC) speech synthesizer using a general purpose floating-point 32-bit digital signal processing (DSP) chip for Mandarin Chinese. The hardware ...
详细信息
This paper considers the design and implementation of a linearpredictive (LPC) speech synthesizer using a general purpose floating-point 32-bit digital signal processing (DSP) chip for Mandarin Chinese. The hardware synthesizer board consists of a plug-in card which can be installed in a personal computer (PC). The software consists of a hand-optimized DSP assembly programme which can be down-loaded from the host PC to the synthesizer RAM for 'real- time' speech synthesis. Computer programmes in Pascal have also been developed to extract the ten reflection coefficients, pitch and gain values for the LPC synthesizer. A database of LPC parameters from the 56 Mandarin phonemes and the four standard pitch contours corresponding to the four lexical tones of monosyllabic Mandarin has also been created for the synthesizer.
The detection of glottal closure instants has been a necessary step in several applications of speech processing, such as voice source analysis, speech prosody manipulation and speech synthesis. This paper presents a ...
详细信息
The detection of glottal closure instants has been a necessary step in several applications of speech processing, such as voice source analysis, speech prosody manipulation and speech synthesis. This paper presents a new algorithm for glottal closure detection that compares favorably with other methods available today in terms of robustness and computational efficiency. In this paper, we propose to use the singular value decomposition (SVD) approach to detect the instants of glottal closure from the speech signal. The proposed SVD method amounts to calculating the Frobenius norms of signal matrices and therefore is computationally efficient. Moreover, it produces well-defined and reliable peaks that indicate the instants of glottal closure. Finally, with the introduction of the total linear least squares technique, two proposed methods are reinvestigated and unified into the SVD framework.
Several vector quantization approaches to the problem of text-dependent speaker verification are described. In each of these approaches, a source codebook is designed to represent a particular speaker saying a particu...
详细信息
Several vector quantization approaches to the problem of text-dependent speaker verification are described. In each of these approaches, a source codebook is designed to represent a particular speaker saying a particular utterance. Later, this same utterance is spoken by a speaker to be verified and is encoded in the source codebook representing the speaker whose identity was claimed. The speaker is accepted if the verification utterance's quantization distortion is less than a prespecified speaker-specific threshold. The best approach achieved a 0.7 percent false acceptance rate and a 0.6 percent false rejection rate on a speaker population comprising 16 admissible speakers and 111 casual imposters. The approaches are described, and detailed experimental results are presented and discussed.
A large class of stationary signals, containing speech signals, but not restricted to them, can be represented by time-varying models, the coefficients of which are finite linear combinations of known time functions. ...
详细信息
A large class of stationary signals, containing speech signals, but not restricted to them, can be represented by time-varying models, the coefficients of which are finite linear combinations of known time functions. Such models have been found useful for speech recognition and speech synthesis, but they suffer in this last application from a lack of stability. A time-varying area-ratio (AR) model, into which the time-dependency is coded through log-area ratios to ensure stability is described. Two algorithms for the estimation of these time-varying log area ratios are proposed; the first one is an approximation using a lattice filter, while the second one minimizes a least-squares criterion. The evaluation of their performance is obtained by a set of simulations. An example of speech signal modeled with these time-varying log area ratios shows the usefulness of this approach for speech synthesis and recognition.< >
The Code Excited linearpredictive (CELP) technique has the potential for producing high quality synthetic speech at bit rates as low as 4.8 kb/s. Most of the complexity in the CELP coders comes from the search used t...
详细信息
The Code Excited linearpredictive (CELP) technique has the potential for producing high quality synthetic speech at bit rates as low as 4.8 kb/s. Most of the complexity in the CELP coders comes from the search used to select an optimal excitation sequence from a code book of stochastic vectors. This paper describes three fast search methods. The key idea here is to inverse filter the actual speech by the formant and pitch filters to produce a residual error sequence (RES). The residual error is used to identify a neighborhood or a subset of codes for further processing. The first method, called Dynamic Nearest Neighborhood (DNN), attempts to dynamically construct a neighborhood of the 6 codes of maximum correlation with the residual error. The second method, called Nearest Fixed Neighborhood (NFN), clusters the code book into a fixed number of cells, then code search is performed on the codes of the nearest cell to the RES. The two methods achieve a reduction in the search procedure by a factor of 8-20 times. The third method combines the advantages of the first two methods to attain a reduction of operations from 40 to 50 times. The performance of these techniques and some of their ramifications will also be addressed.
暂无评论