Techniques for coding voiced speech at very low bit rates are investigated and a new algorithm, designed to produce high quality speech with low complexity, is proposed. This algorithm encodes and transmits partial re...
详细信息
Techniques for coding voiced speech at very low bit rates are investigated and a new algorithm, designed to produce high quality speech with low complexity, is proposed. This algorithm encodes and transmits partial representative waveforms (RW's) from which the complete speech waveforms are reconstructed by using a method called forward-backward waveform prediction (FBWP). The RW is encoded at 20-30 ms intervals with a low complexity approach, taking into account the special initial conditions of short- and long-term biters. The basic idea of FBWP is essentially consistent with that of the PWI algorithm, which was reported to be capable of producing high-quality voiced speech at a bit rate of between 3.0 and 4.0 kb/s. By implementing the FBWP in the time domain, fast computation is thereby made possible while high-quality speech can be obtained at bit rate of about 3 kb/s. As in the PWI method, the proposed algorithm may be combined with an LP-based speech coder which uses a noise-like excitation to reproduce unvoiced speech.
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding an...
详细信息
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.
In this paper, the split Levinson algorithm is used to develop an efficient algorithm to compute the Line Spectrum Pairs (LSP) in Linear Predictive coding (LPC) of speech. We propose two new real functions defined fro...
详细信息
In this paper, the split Levinson algorithm is used to develop an efficient algorithm to compute the Line Spectrum Pairs (LSP) in Linear Predictive coding (LPC) of speech. We propose two new real functions defined from the reciprocal and antireciprocal parts of the predictor polynomials obtained from the split Levinson algorithm. These functions are shown to obey three-term recurrence relations. Thus the LSP parameters are directly available from the eigenvalues of tridiagonal matrices, the entries of which are computed from only one version of the split Levinson algorithm. When compared with other existing methods, this algorithm is better in terms of complexity.
This paper presents a high quality, low bit rate, and portable Internet-phone system. The system consists of a mixed implementation of software and hardware, The hardware includes a portable box that can be plugged in...
详细信息
This paper presents a high quality, low bit rate, and portable Internet-phone system. The system consists of a mixed implementation of software and hardware, The hardware includes a portable box that can be plugged into the conventional parallel port. Three major parts are considered in this box: the speech compression unit, the host interface, and the speakerphone module. A low-cost non-delicate speech coprocessor is embedded to process the heavy job of speech coding, a CPLD device is employed to control the host access timing, a 16-bits PCM CODEC and an audio amplifier with acoustic echo cancellation features are introduced to optimize the speakerphone module. The experimental coding rate is 8.5kbps. In such rate, the popular modems can conform to offer full-duplex speech in real time. Modern applications of this system are dropped on the digital simultaneous voice data (DSVD). Such as Net-game's talking and Video-conferencing.
Currently, the majority of the state-of-the-art speaker recognition systems predominantly use short-term cepstral feature extraction approaches to parameterize the speech signals. In this paper, we propose new auditor...
详细信息
Currently, the majority of the state-of-the-art speaker recognition systems predominantly use short-term cepstral feature extraction approaches to parameterize the speech signals. In this paper, we propose new auditory features based Caelen auditory model that simulate the external, middle and inner parts of the ear and Gammtone filter for speaker recognition system, called Caelen Auditory Model Gammatone Cepstral Coefficients (CAMGTCC). The performances evaluations of the proposed feature are carried by the TIMIT and NIST 2008 corpus. The speech coding represent by Adaptive Multi-Rate wideband (AMR-WB) and noisy conditions using various noises SNR levels which are extracted from NOISEX-92. Speaker recognition system using GMM-UBM and i-vector-GPLDA modelling. The experimental results demonstrate that the proposed feature extraction method performs better compared to the Gammatone Cepstral Coefficients (GTCC) and Mel Frequency Cepstral Coefficients (MFCC) features. For speech coding distortion, the features extraction proposed improve the robustness of codec-degraded speech at different bit rates. In addition, when the test speech signals are corrupted with noise at SNRs ranging from (0 dB to 15 dB), we observe that CAMGTCC achieves overall equal error rate (EER) reduction of 10.88% to 6.8% relative, compared to baselines.
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features...
详细信息
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This article introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.
This paper addresses the design, implementation and evaluation of efficient low bit-rate speech coding algorithms based on an improved sinusoidal model. A series of algorithms were developed for speech classification ...
详细信息
This paper addresses the design, implementation and evaluation of efficient low bit-rate speech coding algorithms based on an improved sinusoidal model. A series of algorithms were developed for speech classification and pitch frequency determination, modeling of sinusoidal amplitudes and phases, and frame interpolation. An improved paradigm for sinusoidal phase coding is presented, where short-time sinusoidal phases are modeled using a combination of linear prediction, spectral sampling, linear phase alignment and all-pass phase error correction components. A class-dependent split vector quantization scheme is used to encode the sinusoidal amplitudes. The masking properties of the human auditory system are effectively exploited in the algorithms. The algorithms were successfully integrated into a 2.4 kbps sinusoidal coder. The performance of the 2.4 kbps coder was evaluated in terms of informal subjective tests such as the mean opinion score (MOS) and the diagnostic rhyme test (DRT), as well as some perceptually motivated objective distortion measures. Performance analysis on a large speech database indicates considerable improvement in short-time signal matching both in the time and the spectral domains. Tn addition, subjective quality of the reproduced speech is considerably improved. (C) 2001 Elsevier Science B,V. All rights reserved.
This paper presents a tree-searched multi-stage vector quantization scheme for LPC parameters which achieves spectral distortion lower than 1 dB with low complexity and good robustness using rates as low as 22 bits/fr...
详细信息
This paper presents a tree-searched multi-stage vector quantization scheme for LPC parameters which achieves spectral distortion lower than 1 dB with low complexity and good robustness using rates as low as 22 bits/frame. The M-L search is used and it is shown that it achieves performance close to that of the optimal search for a relatively small M. A new joint codebook design strategy for multi-stage VQ is presented which improves convergence speed and the VQ performance measures. The best performance/complexity trade-offs are obtained with relatively small size codebooks cascaded in a 3 4 stage configuration. It is shown experimentally that as the number of stages is increased above the optimal performance/complexity trade-off, the quantizer robustness and outlier performance can be improved at the expense of a slight increase in rate. Results for LAR and U P parameters are presented. A training technique that reduces outliers at the expense of a slight average performance degradation is introduced. The robustness across different languages, input spectral shapings, and in the presence of independent random channel errors is studied. Experimental results show that tree-searched multi-stage VQ significantly outperforms the split codebook approach.
In memoryless quantization, neither the encoder nor the decoder has memory, and quantization noise shaping is not used, We show that, by constraining the parameter dynamics during quantization at the encoder, the perf...
详细信息
In memoryless quantization, neither the encoder nor the decoder has memory, and quantization noise shaping is not used, We show that, by constraining the parameter dynamics during quantization at the encoder, the performance of speech coders can be enhanced significantly without adding to the delay, The proposed method retains the advantages of memoryless quantization, including channel-error robustness.
An algorithm is proposed to reduce the complexity and memory requirement of coded-excited linear prediction (CELP) speech coding. The new algorithm is based on the concept of designing a special codebook such that eac...
详细信息
An algorithm is proposed to reduce the complexity and memory requirement of coded-excited linear prediction (CELP) speech coding. The new algorithm is based on the concept of designing a special codebook such that each codeword is orthogonal to its shifting entries. With this orthogonal property, the algorithm reduces the codeword searching complexity of CELP coding significantly. Besides, by rearranging the codeword, only 12.5% of the conventional codebook storage is required. Both segmental SNR and informal listening showed that the performance of the algorithm is equivalent to that of the original CELP algorithm.
暂无评论