The 3GPP Immersive Voice and audio Services (IVAS) codec enables mobile spatial communication through coding of the metadata-assisted spatial audio (MASA) format. The MASA format is a new parametric audio format desig...
详细信息
audio and speech coding lack unified evaluation and open-source testing. Many candidate systems were evaluated on proprietary, non-reproducible, or small data, and machine learning-based codecs are often tested on dat...
详细信息
DRA (Dynamic resolution adaptation) audio coding standard was shown to deploy transient-localized MDCT to effectively suppress pre-echo artifacts and statistic allocation of codebooks to improve the compression effici...
详细信息
DRA (Dynamic resolution adaptation) audio coding standard was shown to deploy transient-localized MDCT to effectively suppress pre-echo artifacts and statistic allocation of codebooks to improve the compression efficiency of Huffman coding. Its quantizers and Huffman codebooks are designed in such a way that a signal path of 24 bits is provided throughout the codec so that high audio quality can be delivered if bit rate suffices. Although simple, it delivers state-of-the-arts compression efficiency as shown by five rounds of ITU-R BS.11116 compliant subjective listening tests.
This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantiz...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.
audio and speech coding lack unified evaluation and open-source testing. Many candidate systems were evaluated on proprietary, non-reproducible, or small data, and machine learning-based codecs are often tested on dat...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
audio and speech coding lack unified evaluation and open-source testing. Many candidate systems were evaluated on proprietary, non-reproducible, or small data, and machine learning-based codecs are often tested on datasets with similar distributions as trained on, which is unfairly compared to digital signal processing-based codecs that usually work well with unseen data. This paper presents a full-band audio and speech coding quality benchmark with more variable content types, including traditional open test vectors. An example use case of audio coding quality assessment is presented with open-source Opus, 3GPP’s EVS, and recent ETSI’s LC3 with LC3+ used in Bluetooth LE audio profiles. Besides, quality variations of emotional speech encoding at 16 kbps are shown. The proposed open-source benchmark contributes to audio and speech coding democratization and is available at https://***/JozefColdenhoff/OpenACE.
In the history of audio and acoustic signal processing, perceptual audio coding has certainly excelled as a bright success story by its ubiquitous deployment in virtually all digital media devices, such as computers, ...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
In the history of audio and acoustic signal processing, perceptual audio coding has certainly excelled as a bright success story by its ubiquitous deployment in virtually all digital media devices, such as computers, tablets, mobile phones, set-top-boxes, and digital radios. From a technology perspective, perceptual audio coding has undergone tremendous development from the first very basic perceptually driven coders (including the popular mp3 format) to today’s full-blown integrated coding/rendering systems. This paper provides a historical overview of this research journey by pinpointing the pivotal development steps in the evolution of perceptual audio coding. Finally, it provides thoughts about future directions in this area.
Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be subo...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame. Furthermore, we propose a gradient estimation method for the non-differentiable masking operation that transforms from the importance map to the binary importance mask, improving model training via a straight-through estimator. We demonstrate that the proposed training framework achieves superior results compared to the baseline method and shows further improvement when applied to the current state-of-the-art codec. audio samples are available at: https://***/***/
Twin VQ (transform-domain weighted interleave vector quantization) is a method that encodes the wideband acoustic signal with a low bit rate. It is transform coding with a basic structure that transforms the input sig...
详细信息
Twin VQ (transform-domain weighted interleave vector quantization) is a method that encodes the wideband acoustic signal with a low bit rate. It is transform coding with a basic structure that transforms the input signal to the frequency domain by MDCT;vector quantization is applied after flattening. This encoding method has characteristic features such as weighted interleave vector quantization, normalization of the frequency characteristics by the linearly predicted spectrum, and interframe prediction in the frequency domain. Especially, high performance is realized for lower bit rates. Another feature is robustness against the error, since adaptive bit assignment is not applied. (C) 1998 Scripta Technica.
Xiong and Malvar recently introduced a nonuniform modulated complex lapped transform (NMCLT) with good time-localization and controllable frequency resolution by using an oversampled nonuniform filter bank to generate...
详细信息
Xiong and Malvar recently introduced a nonuniform modulated complex lapped transform (NMCLT) with good time-localization and controllable frequency resolution by using an oversampled nonuniform filter bank to generate its real and the imaginary components. In this paper, we first show that oversampling in the NMCLT is not necessary in theory but a by-product of fast implementation in practice. We also point out that the amount of oversampling, which can be flexibly controlled, depends on the application. We then describe in detail the implementation of the inverse transform, which was not addressed clearly by Xiong and Malvar. We present the first applications of the NMCLT to audio coding and image denoising. A scalable audio coder has been implemented by controlling the amount of oversampling and exploiting redundancy among the NMCLT coefficients via predictive coding. Experimental results show that the audio coder reduces pre-echoes and improves the sound quality of audio clips with transient sounds. A simple denoising algorithm based on the NMCLT has also been devised to provide images with better visual quality than those obtained with wavelet-based soft thresholding.
Perceptual audio coding schemes typically apply the modified discrete cosine transform (MDCT) with different lengths and windows, and utilize signal-adaptive switching between these on a perframe basis for best subjec...
详细信息
Perceptual audio coding schemes typically apply the modified discrete cosine transform (MDCT) with different lengths and windows, and utilize signal-adaptive switching between these on a perframe basis for best subjective performance. In previous papers, the authors demonstrated that further quality gains can be achieved for some input signals using additional transform kernels such as the modified discrete sine transform (MDST) or greater inter-transform overlap by means of a modified extended lapped transform (MELT). This work discusses the algorithmic procedures and codec modifications necessary to combine all of the above features-transform length, window shape, transform kernel, and overlap ratio switching-into a flexible input-adaptive coding system. It is shown that, due to full time-domain aliasing cancelation, this system supports perfect signal reconstruction in the absence of quantization and, thanks to fast realizations of all transforms, increases the codec complexity only negligibly. The results of a 5.1 multichannel listening test are also reported.
暂无评论