This work addresses the problem of latent space quantization in neural audio coding. A covariance analysis of latent space is performed on several pre-trained audiocoding models (Lyra V2, EnCodec, audioDec). It is pr...
详细信息
ISBN:
(纸本)9789464593617;9798331519773
This work addresses the problem of latent space quantization in neural audio coding. A covariance analysis of latent space is performed on several pre-trained audiocoding models (Lyra V2, EnCodec, audioDec). It is proposed to truncate latent space dimension using a fixed linear transform. The Karhunen-Loeve transform (KLT) is applied on learned residual vector quantization (RVQ) codebooks. The proposed method is applied in a backward-compatible way to EnCodec, and we show that quantization complexity and codebook storage are reduced (by 43.4%), with no noticeable difference in subjective AB tests.
This paper introduces FlowMAC, a novel neuralaudio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantiz...
详细信息
neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. neural codecs trained considering o...
详细信息
neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. neural codecs trained considering only source coding tend to be extremely sensitive to channel noises, especially in wireless channels with high error rate. In this paper, we investigate how to elevate the error resilience of neuralaudio codecs for packet losses that often occur during real-time communications. We propose a feature-domain packet loss concealment algorithm (FD-PLC) for real-time neural speech coding. Specifically, we introduce a self-attention-based module on the received latent features to recover lost frames in the feature domain before the decoder. A hybrid segment-level and frame-level frequency-domain discriminator is employed to guide the network to focus on both the generative quality of lost frames and the continuity with neighbouring frames. Experimental results on several error patterns show that the proposed scheme can achieve better robustness compared with the corresponding error-free and error-resilient baselines. We also show that feature-domain concealment is superior to waveform-domain counterpart as post-processing.
Bitrate scalability is a desirable feature for audiocoding in real-time communications. Existing neuralaudio codecs usually enforce a specific bitrate during training, so different models need to be trained for each...
详细信息
Bitrate scalability is a desirable feature for audiocoding in real-time communications. Existing neuralaudio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.
This paper presents speech quality results to characterize the state of the art and technological advance of recent neuralaudio codecs targeting low bitrates. audio quality was evaluated in one clean speech experimen...
详细信息
This paper presents speech quality results to characterize the state of the art and technological advance of recent neuralaudio codecs targeting low bitrates. audio quality was evaluated in one clean speech experiment (in French). Degradation Mean Opinion Score (DMOS) results are reported and discussed for neuralaudio codecs (LPCNet, Lyra V2, EnCodec, audioCraft, audioDec, Descript audio Codec) - traditional codecs (Opus, EVS) are also included as performance yardsticks. We also discuss observed codec complexity to complement subjective test results.
Deep-learning based methods have shown their advantages in audiocoding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural ...
详细信息
ISBN:
(纸本)9781665405409
Deep-learning based methods have shown their advantages in audiocoding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural speech codec with low latency for RTC. It takes an encoder-temporal filtering-decoder paradigm that has seldom been investigated in audiocoding. An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies. Furthermore, with end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks. Both subjective and objective results demonstrate the efficiency of the proposed TFNet.
暂无评论