检索结果-内蒙古大学图书馆

neural speech coding for Real-Time Communications Using Constant Bitrate Scalar Quantization

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2024年第8期18卷 1462-1476页

作者： Brendel, Andreas Pia, Nicola Gupta, Kishan Behringer, Lyonel Fuchs, Guillaume Multrus, Markus Fraunhofer Inst Integrated Circuits IIS Erlangen Fraunhofer IIS D-91058 Erlangen Germany

neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short-Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.

关键词： Codecs Bit rate Training speech coding Audio coding Quantization (signal) Complexity theory Vectors Real-time systems Representation learning Discrete representation learning low complexity neural speech coding quantization real-time

来源：评论

学校读者我要写书评

暂无评论

Scalable and Efficient neural speech coding: A Hybrid Design

引用

IEEE-ACM TRANSACTIONS ON AUDIO speech AND LANGUAGE PROCESSING 2022年 30卷 12-25页

作者： Zhen, Kai Sung, Jongmo Lee, Mi Suk Beack, Seungkwon Kim, Minje Indiana Univ Dept Comp Sci Bloomington IN 47408 USA Indiana Univ Cognit Sci Program Bloomington IN 47408 USA Elect & Telecommun Res Inst Daejeon 34129 South Korea Indiana Univ Dept Intelligent Syst Engn Bloomington IN 47408 USA

We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.

关键词： speech coding Bit rate Encoding Decoding Vocoders Complexity theory speech codecs neural speech coding waveform coding representation learning model complexity

来源：评论

学校读者我要写书评

暂无评论

A Hybrid DFSMN and Mamba Architecture for Low Bitrate neural speech coding 14

A Hybrid DFSMN and Mamba Architecture for Low Bitrate Neural...

引用

14th International Symposium on Chinese Spoken Language Processing

作者： Zhao, Yuhao Jia, Maoshen Ru, Jiawei Tai, Junqi Beijing Univ Technol Sch Informat Sci & Technol Beijing Peoples R China Beijing Univ Technol Beijing Dublin Int Coll Beijing Peoples R China

ISBN: (纸本)9798331516833;9798331516826

In this paper, we proposed a novel low bitrate neural speech codec based on sequence modeling networks. The proposed method consists of a convolution-based encoder and decoder, a DFSMN-Mamba module, and a vector quantizer. In the proposed method, a DFSMN-Mamba module is designed by combining Deep Feedforward Sequential Memory Network (DFSMN) with selective state space model Mamba, which is used to model the input features in parallel in both time and frequency dimensions. An adversarial loss is used to train the entire codec framework, which enables compression of speech waveforms into compact discrete representations at low bitrates. Experimental results show that the proposed method achieves better performance than the baseline in both subjective and objective evaluation.

关键词： neural speech coding Mamba DFSMN

来源：评论

学校读者我要写书评

暂无评论

A Dual-path Conformer-based Network for neural speech coding 14

A Dual-path Conformer-based Network for Neural Speech Coding

引用

14th International Symposium on Chinese Spoken Language Processing

作者： Ru, Jiawei Jia, Maoshen Zhao, Yuhao Tao, Liang Beijing Univ Technol Sch Informat Sci & Technol Beijing Peoples R China

ISBN: (纸本)9798331516833;9798331516826

In this paper, we propose a neural speech coding method based on the dual-path conformer, which mainly consists of three steps: (1) the encoding and decoding of the time-frequency spectrum are performed by a structure that combines the CNN and the dual-path conformer, (2) residual vector quantization is employed to quantize the output features of encoder and form a compact discrete representation, and (3) multi-period and multi-scale discriminators are used to improve the perceptual quality of speech during adversarial training. Experimental results, from both subjective and objective evaluations, demonstrate that the proposed codec outperforms the state-of-the-art neural codec AudioDEC and the leading conventional codec Opus in terms of performance.

关键词： neural speech coding conformer

来源：评论

学校读者我要写书评

暂无评论

Disentangled Feature Learning for Real-Time neural speech coding 48

Disentangled Feature Learning for Real-Time Neural Speech Co...

引用

48th IEEE International Conference on Acoustics, speech and Signal Processing, ICASSP 2023

作者： Jiang, Xue Peng, Xiulian Zhang, Yuan Lu, Yan Communication University of China Beijing China Microsoft Research Asia Beijing China

ISBN: (纸本)9781728163277

Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework. © 2023 IEEE.

关键词： disentangled feature learning neural speech coding real-time communications

来源：评论

学校读者我要写书评

暂无评论

NESC: Robust neural End-2-End speech coding with GANs 23

NESC: Robust Neural End-2-End Speech Coding with GANs

引用

Interspeech Conference

作者： Pia, Nicola Gupta, Kishan Korse, Srikanth Multrus, Markus Fuchs, Guillaume Fraunhofer IIS Erlangen Erlangen Germany

neural networks have proven to be a formidable tool to tackle the problem of speech coding at very low bit rates. However, the design of a neural coder that can be operated robustly under real-world conditions remains a major challenge. Therefore, we present neural End-2-End speech Codec (NESC) a robust, scalable end-to-end neural speech codec for high-quality wideband speech coding at 3 kbps. The encoder uses a new architecture configuration, which relies on our proposed Dual-PathConvRNN (DPCRNN) layer, while the decoder architecture is based on our previous work Streamwise-StyleMelGAN. Our subjective listening tests on clean and noisy speech show that NESC is particularly robust to unseen conditions and signal perturbations.

关键词： neural speech coding Generative Adversarial Network residual quantization

来源：评论

学校读者我要写书评

暂无评论

AVS3P10 Standard for Real-time speech coding

AVS3P10 Standard for Real-time Speech Coding

引用

2025 IEEE International Conference on Acoustics, speech, and Signal Processing, ICASSP 2025

作者： Xiao, Wei Dou, Weibei Wang, Wenlong Yi, Gaoxiong Li, Jingxin Shang, Shidong Tencent Ethereal Audio Lab Tencent Shenzhen China Department of Electronic Engineering Tsinghua University Beijing China Tencent Ethereal Audio Lab Tencent Beijing China China Electronics Standardization Institute Beijing China

ISBN: (纸本)9798350368741

As the tenth part of the third-generation AVS standard series for real-time speech coding, AVS3P10 is the recent standard completed in the Audio Video coding Standards Workgroup of China (AVS). Combining the state-of-the-art deep generative networks and signal processing methods, AVS3P10 targets defining new generation neural speech codecs with high quality at low bitrates, enabling excellent experiences even when the bitrate is at 5.9 kbps with excellent error resilience. Moreover, it provides wideband and super wideband coding modes, and it supports the extension of stereo coding. Both subjective listening test and objective measurement prove the merit of AVS3P10. Especially, a lightweight model with only 880k parameters is incorporated to maintain the practicality of AVS3P10 in computational efficiency. Conclusively, AVS3P10 demonstrates the maturity of neural speech coding with broad application perspectives in real-time communication. © 2025 IEEE.

关键词： AVS neural speech coding RTC

来源：评论

学校读者我要写书评

暂无评论

DRED: Deep REDundancy coding of speech Using a Rate-Distortion-Optimized Variational Autoencoder

引用

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2024年第8期18卷 1441-1447页

作者： Valin, Jean-Marc Buthe, Jan Mustafa, Ahmed Klingbeil, Michael Xiph Org Fdn Jaffrey NH 03452 USA Amazon Web Serv Palo Alto CA 94303 USA

Despite recent advancements in packet loss concealment (PLC) using deep learning techniques, packet loss remains a significant challenge in real-time speech communication. Redundancy has been used in the past to recover the missing information during losses. However, conventional redundancy techniques are limited in the maximum loss duration they can cover and are often unsuitable for burst packet loss. We propose a new approach based on a rate-distortion-optimized variational autoencoder (RDO-VAE), allowing us to optimize a deep speech compression algorithm for the task of encoding large amounts of redundancy at very low bitrate. The proposed Deep REDundancy (DRED) algorithm can transmit up to 50x redundancy using less than 32 kb/s. Results show that DRED outperforms the existing Opus codec redundancy. We also demonstrate its benefits when operating in the context of WebRTC.

关键词： Audio redundancy neural speech coding variational autoencoder Audio redundancy neural speech coding variational autoencoder

来源：评论

学校读者我要写书评

暂无评论

PERSONALIZED neural speech CODEC 49

PERSONALIZED NEURAL SPEECH CODEC

引用

49th IEEE International Conference on Acoustics, speech, and Signal Processing (ICASSP)

作者： Jang, Inseon Yang, Haici Lim, Wootaek Beack, Seungkwon Kim, Minje Elect & Telecommun Res Inst Daejeon 34129 South Korea Indiana Univ Dept Intelligent Syst Engn Bloomington IN 47408 USA Univ Illinois UrbanaChampaign Dept Comp Sci Champaign IL 61801 USA Indiana Univ Indiana PA USA

ISBN: (纸本)9798350344868;9798350344851

In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the Librispeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.

关键词： speech coding neural speech coding personalization model compression

来源：评论

学校读者我要写书评

暂无评论

LOW BITRATE LOSS RESILIENCE SCHEME FOR A speech ENHANCING neural CODEC 49

LOW BITRATE LOSS RESILIENCE SCHEME FOR A SPEECH ENHANCING NE...

引用

49th IEEE International Conference on Acoustics, speech, and Signal Processing (ICASSP)

作者： Kolundzija, Mihailo Kavalekalam, Mathew Balic, Ivana Mao, Michelle Casas, Raul Cisco Syst San Jose CA 95134 USA

ISBN: (纸本)9798350344868;9798350344851

Deep neural networks have proven their efficacy in encoding high-quality speech and audio at remarkably low bitrates, while also demonstrating superior performance in audio packet loss concealment (PLC) compared to traditional methods. Although ultra low-bitrate speech and audio codecs may appear less practical for real-time voice communication over the Internet due to packetization overhead, they present a promising solution for ensuring uninterrupted voice communication under adverse network conditions. In this paper, we use a neural speech codec designed end-to-end, encompassing a versatile set of features ranging from efficient low-bitrate speech coding and decoding to advanced functionalities such as noise removal, dereverberation, and packet loss concealment. For this codec, we present a long low-bitrate redundancy mechanism for recovering from extended packet loss bursts. We furthermore introduce a memory-efficient entropy coding scheme specifically designed for low-bitrate redundant audio packets. Finally, we demonstrate the effectiveness of the said codec, together with the memory- and bitrate-efficient redundancy, at coping with adverse acoustic and network conditions.

关键词： neural speech coding audio loss resilience

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：