In the last years there has been a growing interest for nonlinear speech models. Several works have been published revealing the better performance of nonlinear techniques, but little attention has been dedicated to t...
详细信息
In the last years there has been a growing interest for nonlinear speech models. Several works have been published revealing the better performance of nonlinear techniques, but little attention has been dedicated to the implementation of the nonlinear model into real applications. This work is focused on the study of the behaviour of a nonlinearpredictive model based on neural nets, in a speech waveform coder. Our novel scheme obtains an improvement in SEGSNR between 1 and 2 dB for an adaptive quantization ranging from 2 to 5 bits.
This paper describes a multi-rate codec family developed as a potential candidate for the GSM adaptive multi-rate (AMR) codec standard. The codec family consists of the GSM enhanced full rate (EFR) codec and lower bit...
详细信息
This paper describes a multi-rate codec family developed as a potential candidate for the GSM adaptive multi-rate (AMR) codec standard. The codec family consists of the GSM enhanced full rate (EFR) codec and lower bit-rate extensions thereof. The codec family consists of several codecs, i.e., modes that have different bit-rate partitionings between source coding and error protection. All the source codecs use the same ACELP-method (algebraic code excited linear predictive coding) used also in the GSM EFR codec. The codec operates at gross bit-rates of 22.8 kbit/s in the GSM full rate (FR) channel and 11.4 kbit/s in the GSM half rate (HR) channel. In the full rate channel, the codec provides improved error robustness over the GSM enhanced full rate (EFR) codec. It extends wireline quality (equal to or better than G.726-32 ADPCM) to poor channel error conditions with low C/I-ratios of 7 dB or even below. When operated in the half rate channel, the codec provides improved channel capacity while still providing wireline quality at high C/I-ratios above 16-19 dB.
This paper provides a formal framework for using the third-order statistics (TOS) of speech signals and presents a new method for estimating the pitch and making voicing decision using the 3rd-order cumulant of the LP...
详细信息
This paper provides a formal framework for using the third-order statistics (TOS) of speech signals and presents a new method for estimating the pitch and making voicing decision using the 3rd-order cumulant of the LPC residual. Analytical expressions for the horizontal slice of the 3rd-order cumulant as well as the kurtosis of voiced speech are derived using the McAulay sinusoidal model (McAulay et al., 1986). The derivations demonstrate that the skewness of voiced speech is sufficiently distinct from that of Gaussian noise and can be used to aid in detecting voicing. It is also shown that the 3rd-order cumulant slice has distinct characteristics in terms of periodicity, phase and harmonic content and is a reliable candidate for estimating the pitch. Actual speech data is used to verify the derivations and experimental results using Gaussian and street noise are used to demonstrate the performance in noisy conditions.
In CELP, the use of codebooks with entries with only a few non-zero samples provides high speech quality and facilitates fast computation. With decreasing bit-rate, the intervals between the pulses increase, and the q...
详细信息
In CELP, the use of codebooks with entries with only a few non-zero samples provides high speech quality and facilitates fast computation. With decreasing bit-rate, the intervals between the pulses increase, and the quality of the reconstructed signal begins to suffer from a particular type of artifact, which is strongest for noise-like segments. In this paper we describe experiments which show that the perceived artifacts are mainly concentrated at frequencies above 3 kHz, and this is consistent with our understanding of auditory theory. Our analysis leads to simple strategies to eliminate the artifacts, even at lower bit rates. We describe both a non-adaptive and an adaptive post-processing method to remove the artifacts. The methods are demonstrated to be efficient when used in the ACELP algorithm. A closed-loop method for ACELP is also described.
We believe that the glide part of a vowel sequence includes the phonetical quality of the previous and following vowels. Therefore, the positive use of glide parts may be effective for developing a speech recognition ...
详细信息
We believe that the glide part of a vowel sequence includes the phonetical quality of the previous and following vowels. Therefore, the positive use of glide parts may be effective for developing a speech recognition system. The trajectories of a glide in a feature space for a speech recognition system have the form of a curve. In this paper, we propose the transformation of trajectories of glides into a straight line in a low-dimensional space, and we show the effectiveness of our new recognition method using nonlinear transformation of a vowel sequence.
This paper examines a new method for coding high quality digital audio signals based on a combination of linear predictive coding (LPC) and the discrete wavelet transform (DWT). In this method, a linear predictor is f...
详细信息
This paper examines a new method for coding high quality digital audio signals based on a combination of linear predictive coding (LPC) and the discrete wavelet transform (DWT). In this method, a linear predictor is first used to model each audio frame. Then, the prediction error is analyzed using the DWT. The LPC coefficients and DWT coefficients are quantized using a novel bit allocation scheme which minimizes the overall quantization error with respect to the masking threshold. The proposed coder is capable of delivering near-transparent audio signal quality at encoding bit rates of around 90-96 kb/s. Objective and subjective results suggest that the proposed coder operating at 90-96 kb/s has a performance comparable to that of the MPEG layer II codec operating at 128 kb/s.
Commonly used robust speaker verification systems are based on time-varying autoregressive spectral estimation (AR) combined with hidden Markov modeling (HMM) or dynamic time warping (DTW). An exhaustive optimization ...
详细信息
Commonly used robust speaker verification systems are based on time-varying autoregressive spectral estimation (AR) combined with hidden Markov modeling (HMM) or dynamic time warping (DTW). An exhaustive optimization of these methods in the past has culminated in quite reliable verification schemes. It seems unlikely, though, that further significant improvements are readily obtained along the same path. Unlike time-varying AR-modeling, which focuses on the the global spectral structure of an utterance, we are introducing a new method that focuses on the local time-varying spectral structure of individual pitch periods. Additionally, a pattern classification method using singular value decomposition (SVD) is employed. The new method by itself does not deliver better results than commonly used global methods; however, it is shown that an acceptance/rejection decision derived from both global and local analysis greatly improves the performance of the verification system.
This paper proposes a method of noise reduction by paired microphones as a front-end processor for speech recognition systems. This method estimates noise using a subtractive microphone array and subtracts them from t...
详细信息
This paper proposes a method of noise reduction by paired microphones as a front-end processor for speech recognition systems. This method estimates noise using a subtractive microphone array and subtracts them from the noisy speech signal using spectral subtraction (SS). Since this method can estimate noise analytically and frame by frame, it is easy to estimate noise not depending on these acoustic properties. Therefore, this method can also reduce non-stationary noise, for example sudden noise when a door has just closed, which cannot be reduced by other SS methods. The results of computer simulations and experiments in a real environment show that this method can reduce LPC log spectral envelope distortions.
作者:
B. ZhangJ3
Department of Electronic Engineering City University of Hong Kong Kowloon Hong Kong China
This paper addresses the problem of recognizing a target voice when it is corrupted by a co-channel interfering voice. First, the F0 contour of the target voice is robustly extracted by using the revised highest likel...
详细信息
This paper addresses the problem of recognizing a target voice when it is corrupted by a co-channel interfering voice. First, the F0 contour of the target voice is robustly extracted by using the revised highest likely common fundamental algorithm proposed by Screenivas. By using this contour, the harmonic peaks of the target voice are extracted. The harmonic peaks carry the information of the formants of a vowel and can be used as a front-end feature in a speech recognizer. Moreover, the harmonic peaks of the target voice are changed little even in the presence of an interfering voice. A recognizer aimed at recognizing the Mandarin finals was developed, based on the harmonic peaks method as well as the conventional LPC cepstral coefficients method. By comparing the results of these two methods, the harmonic peaks method shows better performance in the presence of a co-channel interfering voice.
暂无评论