In this paper, we propose a Bayesian minimum mean squared error approach for the joint estimation of the short-term predictor parameters of speech and noise, from the noisy observation. We use trained codebooks of spe...
详细信息
In this paper, we propose a Bayesian minimum mean squared error approach for the joint estimation of the short-term predictor parameters of speech and noise, from the noisy observation. We use trained codebooks of speech and noise linearpredictive coefficients to model the a priori information required by the Bayesian scheme. In contrast to current Bayesian estimation approaches that consider the excitation variances as part of the a priori information, in the proposed method they are computed online for each short-time segment, based on the observation at hand. Consequently, the method performs well in nonstationary noise conditions. The resulting estimates of the speech and noise spectra can be used in a Wiener filter or any state-of-the-art speech enhancement system. We develop both memoryless (using information from the current frame alone) and memory-based (using information from the current and previous frames) estimators. Estimation of functions of the short-term predictor parameters is also addressed, in particular one that leads to the minimum mean squared error estimate of the clean speech signal. Experiments indicate that the scheme proposed in this paper performs significantly better than competing methods.
Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitatio...
详细信息
Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of speakers. The performance drops gradually as more and more users are registered with the system. This paper proposes a scalable algorithm for real-time text-independent speaker identification based on vowel recognition. Vowel formants are unique across different speakers and reflect the vocal tract information of a particular speaker. The contribution of this paper is the design of a scalable system based on vowel formant filters and a scoring scheme for classification of an unseen instance. Mel-Frequency Cepstral Coefficients (MFCC) and linear predictive coding (LPC) have both been analysed for comparison to extract vowel formants by windowing the given signal. All formants are filtered by known formant frequencies to separate the vowel formants for further processing. The formant frequencies of each speaker are collected during the training phase. A test signal is also processed in the same way to find vowel formants and compare them with the saved vowel formants to identify the speaker for the current signal. A score-based scheme allows the speaker with the highest matching formants to own the current signal. This model requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second. Tests conducted on multiple databases show that this score-based scheme outperforms the back propagation neural network and Gaussian mixture models. Usually, the longer the speech files, the more significant were the improvements in accuracy.
This paper introduces noncausal all-pole models that are capable of efficiently capturing both the magnitude and phase information of voiced speech, It is shown that noncausal all-pole filter models are better able to...
详细信息
This paper introduces noncausal all-pole models that are capable of efficiently capturing both the magnitude and phase information of voiced speech, It is shown that noncausal all-pole filter models are better able to match both magnitude and phase information and are particularly appropriate for voiced speech due to the nature of the glottal excitation. By modeling speech in the frequency domain, the standard difficulties that occur when using noncausal all-pole filters are avoided. Several algorithms for determining the model parameters based on frequency-domain information and the masking effects of the ear are described. Our work suggests that high-quality voiced speech can be produced using a 14th-order noncausal all-pole model.
This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced s...
详细信息
This correspondence presents an experimental system that uses an energy-tracking operator and a related energy separation algorithm to automatically find speech formants and amplitude/frequency modulations in voiced speech segments. Initial estimates of formant center frequencies are provided by either LPC or morphological spectral peak picking. These estimates are then shown to be improved by a combination of bandpass filtering and iterative application of energy separation.
A new modeling technique for voiced speech is introduced. Salient features are detailed modeling of speech waveforms and the use of improved parameter estimation techniques. The ideas of pitch-synchronous analysis are...
详细信息
A new modeling technique for voiced speech is introduced. Salient features are detailed modeling of speech waveforms and the use of improved parameter estimation techniques. The ideas of pitch-synchronous analysis are extended to make two subintervals synchronous with regions of approximately closed and approximately open glottis. Two LPC models are used in each pitch period, and the model parameters are changed at estimated times of transition from open-to-closed and closed-to-open glottis. The excitation is provided by changing initial conditions at these transition instants. Experiments with real, connected speech indicate that the speech waveforms can be accurately represented using the analysis-synthesis approach presented here.
Staggered synthetic aperture radar (SAR) is an innovative SAR acquisition concept which exploits digital beam-forming (DBF) in elevation to form multiple receive beams and continuous variation of the pulse repetition ...
详细信息
Staggered synthetic aperture radar (SAR) is an innovative SAR acquisition concept which exploits digital beam-forming (DBF) in elevation to form multiple receive beams and continuous variation of the pulse repetition interval to achieve high-resolution imaging of a wide continuous swath. Staggered SAR requires an azimuth oversampling higher than an SAR with constant pulse repetition interval (PRI), which results in an increased volume of data. In this article, we investigate the use of linear predictive coding, which exploits the correlation properties exhibited by the nonuniform azimuth raw data stream. According to this, the prediction of each sample is calculated onboard as a linear combination of a set of previous samples. The resulting prediction error is then quantized and downlinked (instead of the original value), which allows for a reduction of the signal entropy and, in turn, of the onboard data rate achievable for a given target performance. In addition, the a priori knowledge of the gap positions can be exploited to dynamically adapt the bit rate allocation and the prediction order to further improve the performance. Simulations of the proposed dynamic predictive block-adaptive quantization (DP-BAQ) are carried out considering a Tandem-L-like staggered SAR system for different orders of prediction and target scenarios, demonstrating that a significant data reduction can be achieved with a modest increase of the system complexity.
This paper presents a new speech coding model targeted at the bit-rate above 4 kbit/s, referred to as multiband code-excited linear prediction (MBCELP). The analysis and synthesis of speech are accomplished in the tim...
详细信息
This paper presents a new speech coding model targeted at the bit-rate above 4 kbit/s, referred to as multiband code-excited linear prediction (MBCELP). The analysis and synthesis of speech are accomplished in the time domain by comparing the original to the synthetic speech while a perceptual criterion is used. A usual short-term linearpredictive filter is employed as the synthesis filter;the excitation signal is modelled as a linear combination of a long-term predictive excitation, periodic multiband excitations and a noise-like excitation;no voiced/unvoiced decision is required. The periodic multiband excitation is produced by convoluting a periodic impulse sequence with a sinc function corresponding to a frequency band;the noise-like excitation is represented by a codebook. We estimate a pitch which is appropriate not only to the long-term predictive filter but also to the periodic multiband excitations and to the 'pitch' prefilter in the decoder. Several CELP vocoders are developed as a reference to test the property of the MBCELP vocoder. Listening tests clearly indicate that this vocoder reconstructed very high quality speech without 'buzziness' or 'hoarseness' for both clean and noisy speech. A 4.8 kbit/s MBCELP vocoder is shown as an example. Its perceptual quality is virtually identical to the original 8 kbit/s CELP vocoder and the improved 7.2 kbit/s CELP vocoder. Since less subframes are used for the MBCELP vocoders, their complexity is not greater than that of usual CELP vocoders with the same type of codebook. A lot of techniques used to simplify CELP coding can be also adopted for the MBCELP coding.
We have devised a high-quality frequency-domain audio coder based on the state-of-the-art monaural wide-band coder aiming at its use in low-delay and low-bit-rate conditions. The coder efficiently represents frequency...
详细信息
We have devised a high-quality frequency-domain audio coder based on the state-of-the-art monaural wide-band coder aiming at its use in low-delay and low-bit-rate conditions. The coder efficiently represents frequency spectral envelopes of the target signals with low computational complexity using optimally prepared non-negative sparse matrices. The experimental results reveal that this representation has positive effects on the objective and subjective quality of the coder resulting in the comparable quality to the same bit rate of 3GPP Extended Adaptive Multi-Rate WideBand (AMR-WB+), a coder which permits more than four times longer delay compared with the proposed coder. Consequently, this coder is suitable for applications in mobile communications, which require low delay and low complexity.
One-multiplier realizations for certain recently reported FIR lossless lattice structures are investigated. The multiplier extraction approach is used to show that there does not exist a real one-multiplier realizatio...
详细信息
One-multiplier realizations for certain recently reported FIR lossless lattice structures are investigated. The multiplier extraction approach is used to show that there does not exist a real one-multiplier realization whereas it is possible to get complex one-multiplier realizations. This is unlike the situation in conventional linear-prediction FIR lattice structures, where real one-multiplier realizations are possible.
In this correspondence, we review the backward-filtering algorithm, and give a compact proof of its validity using matrix notation. We will review the relation between backward filtering and off-line perceptual weight...
详细信息
In this correspondence, we review the backward-filtering algorithm, and give a compact proof of its validity using matrix notation. We will review the relation between backward filtering and off-line perceptual weighting in sparse-codebook CELP, and will show how a combination on-line/off-line parallel weighting algorithm can be used to reduce the search complexity of an overlapped sparse codebook by 30% to 50%.
暂无评论