This paper is a camel.* Short of a book-length text (which, incidentally, one of us (NSJ) is writing), our task to discuss the field of digital speech coding is well-nigh impossible. Nevertheless, we have attempted to...
详细信息
During the development of modern communication technology, although wideband speech coding can provide high-fidelity speech transmission, its high bandwidth requirements limit its application in resource-constrained e...
详细信息
During the development of modern communication technology, although wideband speech coding can provide high-fidelity speech transmission, its high bandwidth requirements limit its application in resource-constrained environments. Narrowband speech coding still holds research value. However, traditional narrowband low bit- rate speech coding methods usually cannot generate satisfactory speech quality. To address this issue, this paper proposes a narrowband low bit-rate speech coding architecture called PMVQCodec, with the following major improvements. Firstly, we design a predictive multi-level vector quantization (PMVQ) technique, which employs a predictor to effectively capture the correlations between latent frame vectors and combines it with multilevel vector quantization to enhance quantization efficiency. Additionally, we also introduce a full-band feature extractor to effectively reduce the computational complexity. In our experiments, both subjective and objective evaluations demonstrated the effectiveness of the proposed PMVQCodec architecture. Our proposed method can achieve higher quality reconstructed speech than Encodec and HiFiCodec at 1.2 kbps and 2.4 kbps, and even outperforms LyraV2 at 6 kbps.
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by "recency bias", CLM lacks sufficient attention to coarse-grained informat...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by "recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.
Currently, balancing low bitrate coding with speech quality is a highly debated topic in the research community. At very low bitrates, existing methods often fail to maintain speech naturalness, intelligibility, and p...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Currently, balancing low bitrate coding with speech quality is a highly debated topic in the research community. At very low bitrates, existing methods often fail to maintain speech naturalness, intelligibility, and personalization. To address this issue, we introduce an innovative ultra-low bitrate semantic speech coding approach, termed Semantic speech coding (SSC). Specifically, the multi-level feature extraction and compression mechanism sequentially extracts and compresses speech features at different levels, ensuring speech quality at ultra-low bit rates. Using a semantic vector quantization codec to fuse spectral and pitch features to extract essential semantic information, achieving more efficient compression while enhancing intelligibility and naturalness. The low-data-overhead speaker feature encoder captures time-invariant speaker characteristics, enabling personalized speech synthesis without additional data overhead, ensuring the synthesized speech retains personalization and naturalness. The diffusion loss mechanism employs a conditional diffusion model to progressively restore details, mitigating the detail loss typically seen in conventional codecs, further enhancing the naturalness and realism of the synthesized speech. We achieved significant improvements in speech quality at an ultra-low bitrate of 106 bps, which approaches the theoretical upper limit of information rate.
As the tenth part of the third-generation AVS standard series for real-time speech coding, AVS3P10 is the recent standard completed in the Audio Video coding Standards Workgroup of China (AVS). Combining the state-of-...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
As the tenth part of the third-generation AVS standard series for real-time speech coding, AVS3P10 is the recent standard completed in the Audio Video coding Standards Workgroup of China (AVS). Combining the state-of-the-art deep generative networks and signal processing methods, AVS3P10 targets defining new generation neural speech codecs with high quality at low bitrates, enabling excellent experiences even when the bitrate is at 5.9 kbps with excellent error resilience. Moreover, it provides wideband and super wideband coding modes, and it supports the extension of stereo coding. Both subjective listening test and objective measurement prove the merit of AVS3P10. Especially, a lightweight model with only 880k parameters is incorporated to maintain the practicality of AVS3P10 in computational efficiency. Conclusively, AVS3P10 demonstrates the maturity of neural speech coding with broad application perspectives in real-time communication.
This paper reviews the main algorithms Sor speech coning at low, and very low, bit rates, from 50 bps to 4000 bps. Then the HSX technique for coding at 1200 bps and a new segmental method with automatically derived un...
详细信息
This paper reviews the main algorithms Sor speech coning at low, and very low, bit rates, from 50 bps to 4000 bps. Then the HSX technique for coding at 1200 bps and a new segmental method with automatically derived units for very low bit rate coding are presented in details.
The function of a speech coding algorithm is to convert an analogue speech signal into a digital form for efficient transmission over a digital path, or efficient storage on a digital storage medium, and to perform th...
详细信息
The function of a speech coding algorithm is to convert an analogue speech signal into a digital form for efficient transmission over a digital path, or efficient storage on a digital storage medium, and to perform the complementary function of converting a received digital signal back to analogue form. The article reviews those speech coding techniques which are already being extensively used in telecommunications applications. As well as explaining the basic principles employed by these speech coding algorithms to achieve efficient digital encoding, examples of telecommunications services which use these algorithms are presented.
A type of speech coding for asynchronous transfer mode (ATM) is described. Cell processing, which improves service quality, is taken into account. Missing-cell recovery methods are discussed, and the distinctive featu...
详细信息
A type of speech coding for asynchronous transfer mode (ATM) is described. Cell processing, which improves service quality, is taken into account. Missing-cell recovery methods are discussed, and the distinctive features of missing-cell recovery methods used with low-bit-rate coding are examined. An example of the speech quality obtained using speech coding techniques in the ATM networks is described. The performance levels for increasing cell loss are compared for various speech coding methods, in combination with methods for dividing coded speech signals into cells and discarding cells. Representative feasible network applications of coding technologies are considered.< >
In this paper we propose a nonlinear predictive speech encoder based on an adaptive combiner with a neural net that weighs the prediction of several nonlinear predictors. Thus, we exploit the advantages of data fusion...
详细信息
In this paper we propose a nonlinear predictive speech encoder based on an adaptive combiner with a neural net that weighs the prediction of several nonlinear predictors. Thus, we exploit the advantages of data fusion on a nonlinear prediction scheme, where it appears in a more natural way than for linear predictors. Experimental results reveal that this scheme outperforms the fixed combination (with mean, median, etc. operators) up to 1.5 dB in SEGSNR. (C) 2005 Elsevier B.V. All rights reserved.
The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has...
详细信息
The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to: represent the spectral properties of speech, provide for speech waveform matching, and ''optimize'' the coder's performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards. The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications. Although the emphasis is on the new low-rate coders, we attempt to provide a comprehensive survey by covering some of the traditional methodologies as well. We feel that this approach will not only point out key references but will also provide valuable background to the beginner. The paper starts with a historical perspective and continues with a brief discussion on the speech properties and performance measures. We then proceed with descriptions of waveform coders, sinusoidal transform coders, linear predictive vocoders, and analysis-by-synthesis linear predictive coders. Finally, we present concluding remarks followed by a discussion of opportunities for future research.
暂无评论