A high performance speech processing integrated circuit (SPIC) based on linear predictive coding (LPC) techniques is presented. Both system and technological aspects of the SPCI design are covered in detail. The SPIC ...
详细信息
A high performance speech processing integrated circuit (SPIC) based on linear predictive coding (LPC) techniques is presented. Both system and technological aspects of the SPCI design are covered in detail. The SPIC synthesizer chip will normally be used in a three-chip minimum system configuration including the synthesizer, a microcomputer, and an external vocabulary ROM. The speech quality can be tailored to the user's requirements by varying the bit rate between the vocabulary ROM and the microcomputer from 1.1 to 8.5 kbit/s. Among the specific features of the SPIC are pitch synchronous synthesis, speech parameters interpolation capability, silence, and power-down mode. Moreover, the digital filter output is interpolated at a high sampling rate (32 kHz) to avoid the necessity for off-chip filtering. An 8-bit PCM output (A law) and a 16-bit linear-coded output are provided. The SPIC can be delivered in two different bonding configurations either for small system application (three-chip system) or for larger system configuration.
A subjective evaluation of seven pitch detectors has been carried out using synthetic speech. The evaluation is intended to complement the objective performance evaluation of the same pitch detection algorithms in the...
详细信息
A subjective evaluation of seven pitch detectors has been carried out using synthetic speech. The evaluation is intended to complement the objective performance evaluation of the same pitch detection algorithms in the investigation of Rabiner et al. [1]. In the earlier study, each of the seven algorithms was evaluated on the basis of its performance with respect to four different types of errors. The standard of comparison was a semiautomatically determined pitch contour of each utterance in the experimental corpus. In the present study, the quality of LPC (linear predictive coding) analyzed and synthesized speech was evaluated. The pitch contour used in the synthesis was obtained either from one of the seven pitch detectors or from the semiautomatic pitch analysis. Using a computer-controlled sort board, an experiment was run in which each of eight listeners was asked to rank the nine versions of each utterance (the natural version was included to provide a stable anchor point). Results are presented on the overall preference for each pitch detector. In addition, subject preference as a function of the pitch range of the speaker and the transmission environment used in the recording is discussed. The present results are compared to those obtained in the earlier objective performance study.
We present an efficient algorithm for determining fundamental frequency and voiced/unvoiced (V/UV) decision of speech. The pitch extractor utilizes the cross-correlation average magnitude difference function (AMDF) wa...
详细信息
We present an efficient algorithm for determining fundamental frequency and voiced/unvoiced (V/UV) decision of speech. The pitch extractor utilizes the cross-correlation average magnitude difference function (AMDF) waveform that is obtained from the linear prediction residual signal. The decision logic used in pitch extraction is simple and reliable. The periodicity and null depth of AMDF waveforms, together with the average residual energy and the past pitch information, are used in the decision logic for fundamental frequency and V/UV decision. Computer simulation of the algorithm yielded accurate results, even for difficult phonemes for pitch extraction.
A statistical decision approach to the recognition of connected digits is described in this paper. The method can be either speaker dependent (i.e., each new speaker must first train the system on representative digit...
详细信息
A statistical decision approach to the recognition of connected digits is described in this paper. The method can be either speaker dependent (i.e., each new speaker must first train the system on representative digit strings before he can successfully use the system) or speaker independent. Multiple repetitions of each digit (spoken in connected strings) are used in the training sequence. Repetitions of the same digit are combined by linearly warping the individual reference patterns to the speakers' average length for the digit. Statistics of the mean and covariance of the recognition parameters between repetitions of the same digit are computed and are used in the recognition phase of the system. Once a spoken digit string has been segmented, the recognition of each digit within the string is achieved using a distance measure based on an expanded form of the principle of minimum residual error. In cases where a great deal of coarticulation can be anticipated between adjacent digits (i.e., between digits bounded by voiced regions) a second distance metric is employed. This metric includes both the effects of the analysis estimation error and the effects of coarticulation. The analysis parameters used in this system are the linear prediction coefficients (LPC's) of a 10-pole LPC analysis. For stability purposes, the linear predictive coding (LPC) coefficients are converted to parcor or reflection coefficients prior to the linear warping, and then the warped parcor coefficients are converted back to LPC coefficients for recognition purposes. The recognition system was tested on six speakers in the speaker-dependent mode with recognition accuracies of from 97 to 100 percent. It was also tested with 10 new speakers in the speaker-independent mode, with a digit recognition accuracy of 95 percent.
Automatic speech emotion recognition (ASER) from source speech signals is quite a challenging task since the recognition accuracy is highly dependent on extracted features of speech that are utilized for the classific...
详细信息
Automatic speech emotion recognition (ASER) from source speech signals is quite a challenging task since the recognition accuracy is highly dependent on extracted features of speech that are utilized for the classification of speech emotion. In addition, pre-processing and classification phases also play a key role in improving the accuracy of ASER system. Therefore, this paper proposes a deep learning convolutional neural network (DLCNN)-based ASER model, hereafter denoted with ASERNet. In addition, the speech denoising is employed with spectral subtraction (SS) and the extraction of deep features is done using integration of linear predictive coding (LPC) with Mel-frequency Cepstrum coefficients (MFCCs). Finally, DLCNN is employed to classify the emotion of speech from extracted deep features using LPC-MFCC. The simulation results demonstrate the superior performance of the proposed ASERNet model in terms of quality metrics such as accuracy, precision, recall, and F1-score, respectively, compared to state-of-the-art ASER approaches.
Voice processing has made considerable progress in the last 10 years. Interaction with computer systems using spoken language is becoming common in consumer products, office systems and telecommunications applications...
详细信息
Voice processing has made considerable progress in the last 10 years. Interaction with computer systems using spoken language is becoming common in consumer products, office systems and telecommunications applications. The article focuses on speech technology for computer systems. We briefly review voice technology, its current status, and specific aspects of automatic speech recognition, speech synthesis and applications.
This paper introduces concept of G.723.1 speech coder and analyses its technology and features. We advise to optimize its running time of G.723.1 speech coder. We put forward to improve some modules with large computa...
详细信息
This paper introduces concept of G.723.1 speech coder and analyses its technology and features. We advise to optimize its running time of G.723.1 speech coder. We put forward to improve some modules with large computational complexity, such as pitch estimation module, the adaptive and the fixed codebook research modules.
暂无评论