This paper provides a formal framework for using the third-order statistics (TOS) of speech signals and presents a new method for estimating the pitch and making voicing decision using the 3rd-order cumulant of the LP...
详细信息
This paper provides a formal framework for using the third-order statistics (TOS) of speech signals and presents a new method for estimating the pitch and making voicing decision using the 3rd-order cumulant of the LPC residual. Analytical expressions for the horizontal slice of the 3rd-order cumulant as well as the kurtosis of voiced speech are derived using the McAulay sinusoidal model (McAulay et al., 1986). The derivations demonstrate that the skewness of voiced speech is sufficiently distinct from that of Gaussian noise and can be used to aid in detecting voicing. It is also shown that the 3rd-order cumulant slice has distinct characteristics in terms of periodicity, phase and harmonic content and is a reliable candidate for estimating the pitch. Actual speech data is used to verify the derivations and experimental results using Gaussian and street noise are used to demonstrate the performance in noisy conditions.
This paper describes a multi-rate codec family developed as a potential candidate for the GSM adaptive multi-rate (AMR) codec standard. The codec family consists of the GSM enhanced full rate (EFR) codec and lower bit...
详细信息
This paper describes a multi-rate codec family developed as a potential candidate for the GSM adaptive multi-rate (AMR) codec standard. The codec family consists of the GSM enhanced full rate (EFR) codec and lower bit-rate extensions thereof. The codec family consists of several codecs, i.e., modes that have different bit-rate partitionings between source coding and error protection. All the source codecs use the same ACELP-method (algebraic code excited linear predictive coding) used also in the GSM EFR codec. The codec operates at gross bit-rates of 22.8 kbit/s in the GSM full rate (FR) channel and 11.4 kbit/s in the GSM half rate (HR) channel. In the full rate channel, the codec provides improved error robustness over the GSM enhanced full rate (EFR) codec. It extends wireline quality (equal to or better than G.726-32 ADPCM) to poor channel error conditions with low C/I-ratios of 7 dB or even below. When operated in the half rate channel, the codec provides improved channel capacity while still providing wireline quality at high C/I-ratios above 16-19 dB.
In CELP, the use of codebooks with entries with only a few non-zero samples provides high speech quality and facilitates fast computation. With decreasing bit-rate, the intervals between the pulses increase, and the q...
详细信息
In CELP, the use of codebooks with entries with only a few non-zero samples provides high speech quality and facilitates fast computation. With decreasing bit-rate, the intervals between the pulses increase, and the quality of the reconstructed signal begins to suffer from a particular type of artifact, which is strongest for noise-like segments. In this paper we describe experiments which show that the perceived artifacts are mainly concentrated at frequencies above 3 kHz, and this is consistent with our understanding of auditory theory. Our analysis leads to simple strategies to eliminate the artifacts, even at lower bit rates. We describe both a non-adaptive and an adaptive post-processing method to remove the artifacts. The methods are demonstrated to be efficient when used in the ACELP algorithm. A closed-loop method for ACELP is also described.
This paper examines a new method for coding high quality digital audio signals based on a combination of linear predictive coding (LPC) and the discrete wavelet transform (DWT). In this method, a linear predictor is f...
详细信息
This paper examines a new method for coding high quality digital audio signals based on a combination of linear predictive coding (LPC) and the discrete wavelet transform (DWT). In this method, a linear predictor is first used to model each audio frame. Then, the prediction error is analyzed using the DWT. The LPC coefficients and DWT coefficients are quantized using a novel bit allocation scheme which minimizes the overall quantization error with respect to the masking threshold. The proposed coder is capable of delivering near-transparent audio signal quality at encoding bit rates of around 90-96 kb/s. Objective and subjective results suggest that the proposed coder operating at 90-96 kb/s has a performance comparable to that of the MPEG layer II codec operating at 128 kb/s.
Commonly used robust speaker verification systems are based on time-varying autoregressive spectral estimation (AR) combined with hidden Markov modeling (HMM) or dynamic time warping (DTW). An exhaustive optimization ...
详细信息
Commonly used robust speaker verification systems are based on time-varying autoregressive spectral estimation (AR) combined with hidden Markov modeling (HMM) or dynamic time warping (DTW). An exhaustive optimization of these methods in the past has culminated in quite reliable verification schemes. It seems unlikely, though, that further significant improvements are readily obtained along the same path. Unlike time-varying AR-modeling, which focuses on the the global spectral structure of an utterance, we are introducing a new method that focuses on the local time-varying spectral structure of individual pitch periods. Additionally, a pattern classification method using singular value decomposition (SVD) is employed. The new method by itself does not deliver better results than commonly used global methods; however, it is shown that an acceptance/rejection decision derived from both global and local analysis greatly improves the performance of the verification system.
作者:
B. ZhangJ3
Department of Electronic Engineering City University of Hong Kong Kowloon Hong Kong China
This paper addresses the problem of recognizing a target voice when it is corrupted by a co-channel interfering voice. First, the F0 contour of the target voice is robustly extracted by using the revised highest likel...
详细信息
This paper addresses the problem of recognizing a target voice when it is corrupted by a co-channel interfering voice. First, the F0 contour of the target voice is robustly extracted by using the revised highest likely common fundamental algorithm proposed by Screenivas. By using this contour, the harmonic peaks of the target voice are extracted. The harmonic peaks carry the information of the formants of a vowel and can be used as a front-end feature in a speech recognizer. Moreover, the harmonic peaks of the target voice are changed little even in the presence of an interfering voice. A recognizer aimed at recognizing the Mandarin finals was developed, based on the harmonic peaks method as well as the conventional LPC cepstral coefficients method. By comparing the results of these two methods, the harmonic peaks method shows better performance in the presence of a co-channel interfering voice.
A systematic method to design components of a 7-16 connector type precision short-open-load-thru (SOLT) calibration kit for the wireless industry using the High Frequency Structure Simulator (HFSS) at microwave freque...
详细信息
A systematic method to design components of a 7-16 connector type precision short-open-load-thru (SOLT) calibration kit for the wireless industry using the High Frequency Structure Simulator (HFSS) at microwave frequencies is presented. To achieve a better than 60 dB accuracy in the nominal optimized value of each individual feature, only one of the 5/spl deg/ segments of the coaxial structure was used to model the equivalent 2D longitudinal cross-section. The validation process for this simplification, together with the test results on the components using a reference TRL/LRL calibration kit for the microwave measurements, is demonstrated. A better than 1.025 VSWR for the 7 mm to 7-16 adapters up to 7.5 GHz is reported.
A method for the stabilization of stationary and time-varying autoregressive models is presented. The method is based on the hyperstability constrained LS-problem with nonlinear constraints. The problems are solved it...
详细信息
A method for the stabilization of stationary and time-varying autoregressive models is presented. The method is based on the hyperstability constrained LS-problem with nonlinear constraints. The problems are solved iteratively with Gauss-Newton type algorithm that sequentially linearizes the constraints. The proposed method is applied to simulated data in the stationary case and to real EEG data in the time-varying case.
A content-based video indexing method is presented that aims at temporally indexing a video sequence according to the actual speaker. This is achieved by the integration of audio and visual information. Audio analysis...
详细信息
A content-based video indexing method is presented that aims at temporally indexing a video sequence according to the actual speaker. This is achieved by the integration of audio and visual information. Audio analysis leads to the extraction of a speaker identity label versus time diagram. Visual analysis includes scene cut detection, face shot determination, mouth region extraction and tracking and finally talking face shot determination. Results from both sources are combined to improve speaker dependent video indexing. Such a task enables flexible video retrieval or browsing in cases where queries according to speaker identities are imposed. Speaker recognition errors are reduced to 2%.
暂无评论