Voice biometrics is an economic method of person authentication with the help of machines because of low cost and high power computers. In this paper, we investigate the problem of spectral resolution in female speech...
详细信息
Voice biometrics is an economic method of person authentication with the help of machines because of low cost and high power computers. In this paper, we investigate the problem of spectral resolution in female speech for speaker identification. Finally, a speaker recognition system is presented to compare the relative performance of different LP-based features such as (LPC and LPCC) and filterbank-based features such as Mel-frequency cepstral coefficients (MFCC) for identification of female speakers. The results are shown for database collected from 15 female speakers in Bengali.
This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance. The recognition algorithm consists of two processes ; dynamic time warping and statis...
详细信息
This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance. The recognition algorithm consists of two processes ; dynamic time warping and statistical word discrimination. In the first process, input speech is compared with each word template using the dynamic time warping technique. Multiple word templates are used to deal with speech variations among speakers, where each word template is represented by a sequence of phoneme-like templates. To attain high recognition ability, a new technique for generating word templates is proposed. In the second process, statistical word discrimination is carried out for word candidates which have relatively low reliability in the first process. Discrimination functions are calculated based on statistics of transition tendencies of speech characteristics between adjacent frames, and the final word decision is made. The system was trained using utterances from 1305 speakers and tested with utterances from 259 speakers. The average recognition rate of 96.5% was obtained for a 16-word Japanese vocabulary set.
An excitation modeling based on a multitap adaptive or fixed codebook is proposed for low-delay encoding of speech signals. The algorithm can produce high-quality speech at bit rates of 8 kbit/s, but it is still lacki...
详细信息
An excitation modeling based on a multitap adaptive or fixed codebook is proposed for low-delay encoding of speech signals. The algorithm can produce high-quality speech at bit rates of 8 kbit/s, but it is still lacking the toll quality of LD-CELP (low-delay code excited linear prediction) at 16 kbit/s. A simplified vector quantization search procedure was incorporated to reduce the complexity with little subjective degradation. The performance results indicated a need for several types of efficient excitation modeling, depending on the nature of the input signal.< >
The line spectrum pairs (LSP) provide an efficient representation of the synthesis filter used in linear predictive coding of speech. In this paper, an attempt to find the best distance measure for vector quantization...
详细信息
The line spectrum pairs (LSP) provide an efficient representation of the synthesis filter used in linear predictive coding of speech. In this paper, an attempt to find the best distance measure for vector quantization is carried out, in the first place, by making objective studies over the same training sequence. Lastly, fast VQ algorithms of the LSP parameters are compared in terms of complexity, using the Euclidean distance measure. The well-known ordering property of LSP parameters is exploited to improve the efficiency of minimum distortion encoder for VQ in terms of norm associated to its distance. As conventional full search is too complex for practical implementation, the originality of this work consists in using the norm to limit the size of the area which contains the nearest neighbor of an input vector to he quantized. This method results in a substantial reduction in search complexity with only a minor degradation in terms of average spectral distortion.< >
This paper contributes insight into the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used ...
详细信息
This paper contributes insight into the sources of variability in vowel formant estimation, a major analytic activity in sociophonetics, by reviewing the outcomes of two simulations that manipulated the settings used for linear predictive coding (LPC)-based vowel formant estimation. Simulation 1 explores the range of frequency differences obtained when minor adjustments are made to LPC settings, and measurement time points around the settings used by trained analysts, in order to determine the range of variability that should be expected in sociophonetic vowel studies. Simulation 2 examines the variability that emerges when LPC settings are varied combinatorially around constant default settings, rather than settings set by trained analysts. The impacts of different LPC settings are discussed as a way of demonstrating the inherent properties of LPC-based formant estimation. This work suggests that differences more fine-grained than about 10 Hz in F1 and 15-20 Hz in F2 are within the range of LPC-based formant estimation variability.
In conventional automatic speech recognition systems, linguistic information of the speech signal are usually acquired from short-time frames about 10-30 ms. In this paper we have proposed two novel methods extracting...
详细信息
ISBN:
(纸本)9781424481835
In conventional automatic speech recognition systems, linguistic information of the speech signal are usually acquired from short-time frames about 10-30 ms. In this paper we have proposed two novel methods extracting the long-term information of the speech signal. Both of the methods are based on "sub-band FDLP" which divides the long-time frame of signal into several sub-bands. Using the MFCC algorithm, we are able to represent the long-term temporal features of the each sub-band. Our results show that the proposed methods could improve the recognition ratio by %1.73. The proposed methods were evaluated using the FarsDat database and the method's robustness against different conditions of noise was experimented.
The current front-end for distributed speech recognition (DSR) systems provided by European Telecommunications Standards Institute (ETSI) is mainly based on the state-of-the-art MFCC features. The method proposed in t...
详细信息
The current front-end for distributed speech recognition (DSR) systems provided by European Telecommunications Standards Institute (ETSI) is mainly based on the state-of-the-art MFCC features. The method proposed in this paper aims to improve the performance of the present ETSI DSR-XAFE (XAFE: extended Audio Front-End). For this purpose two sets of acoustical features namely formant-like features and MFCC features are integrated under the multi-stream framework to form a feature vector which is more robust against additive noise. It is shown that for noisy speech, combining cepstral coefficients with main spectral peaks also known as formant-like features, using the multi-stream framework, leads to significant improvement in word recognition accuracy relative to word accuracy obtained for MFCCs alone.
Lag windowing has long been used for the autocorrelation method of linearpredictive (LP) analysis to prevent possible instability of the synthesis filter with the obtained coefficients. We have investigated the lag-w...
详细信息
ISBN:
(纸本)9781479975921
Lag windowing has long been used for the autocorrelation method of linearpredictive (LP) analysis to prevent possible instability of the synthesis filter with the obtained coefficients. We have investigated the lag-window shape in terms of the trade-offs between stability and the coding efficiency. On the basis of these investigations, we have devised an adaptive selection scheme in which the window shape selected depends on the periodicity of the signal. This scheme has proven to be effective for LP analysis to enhance the coding efficiency in both time and frequency domains in general. This scheme has thus been included in the speech and audio coding schemes of the newly established 3GPP EVS codec standard.
This book is the first in-depth unified presentation of the important area of linear prediction in speech processing. It covers linear prediction from detailed theoretical considerations through practical applications...
详细信息
This book is the first in-depth unified presentation of the important area of linear prediction in speech processing. It covers linear prediction from detailed theoretical considerations through practical applications including Fortran program implementations of important algorithms. linear Prediction Formulations, Speech Synthesis Structures, Spectral Analysis, Formant and Fundamental Frequency Estimation, Computational Considerations, and Vocoders are presented with emphasis on interrelating the two most widely used forms (the autocorrelation method and the covariance method). Because of the depth of presentation from theoretical derivations through computer programs, the material should be applicable to a wide range of backgrounds. The book is written mainly for those interested in acoustical speech processing, although certain portions will be of interest to other backgrounds in speech research and digital signal processing.
暂无评论