In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it in...
详细信息
ISBN:
(数字)9798350392258
ISBN:
(纸本)9798350392265
In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.
Compared with the existing speech recognition technology, speech recognition based on a spiking neural network has higher robustness, lower cost, lower power consumption, and more biological basis. Spiking neural netw...
详细信息
Compared with the existing speech recognition technology, speech recognition based on a spiking neural network has higher robustness, lower cost, lower power consumption, and more biological basis. Spiking neural network highly depends on spiking signals as input, so an audio coding method for spiking neural networks is necessary. The audio coding circuit in this paper is based on the principle of the human cochlea and is completely constructed by analog signals. It enables audio coding to achieve faster response speed and higher robustness. Based on the principle of human cochlea and the experience of existing audio coding methods, this paper designs a new spike coding circuit for the audio signal. After the audio signal is input into the circuit, the channels of different frequencies have spiking signal output.
Traditional audio codecs based on real-valued transforms utilize separate and largely independent algorithmic schemes for parametric coding of noise-like or high-frequency spectral components as well as channel pairs....
详细信息
ISBN:
(纸本)9781479988518
Traditional audio codecs based on real-valued transforms utilize separate and largely independent algorithmic schemes for parametric coding of noise-like or high-frequency spectral components as well as channel pairs. It is shown that in the frequency-domain part of coders such as Extended HE-AAC, these schemes can be unified into a single algorithmic block located at the core of the modified discrete cosine transform path, enabling greater flexibility like semi-parametric coding and large savings in codec delay and complexity. This paper focuses on the stereo coding aspect of this block and demonstrates that, by using specially chosen spectral configurations when deriving the parametric side-information in the encoder, perceptual artifacts can be reduced and the spatial processing in the decoder can remain real-valued. Listening tests confirm the benefit of our proposal at intermediate bit-rates.
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information c...
详细信息
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information carried by audio signals. This is in contrast to model-based methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information. Signal processing has traditionally been at the core of spatial audio systems, and it continues to play a very important role. The irruption of deep learning in many closely related fields has put the focus on the potential of learning-based approaches for the development of data-based spatial audio applications. This article reviews the most important application domains of data-based spatial audio including well-established methods that employ conventional signal processing while paying special attention to the most recent achievements that make use of machine learning. Our review is organized based on the topology of the spatial audio pipeline that consist in capture, processing/manipulation, and reproduction. The literature on the three stages of the pipeline is discussed, as well as on the spatial audio representations that are used to transmit the content between them, highlighting the key references and elaborating on the underlying concepts. We reflect on the literature based on a juxtaposition of the prerequisites that made machine learning successful in domains other than spatial audio with those that are found in the domain of spatial audio as of today. Based on this, we identify routes that may facilitate future advancement.
Synthetic human speech signals have become very easy to generate given modern text-to-speech methods. When these signals are shared on social media they are often compressed using the Advanced audio coding (AAC) stand...
详细信息
Synthetic human speech signals have become very easy to generate given modern text-to-speech methods. When these signals are shared on social media they are often compressed using the Advanced audio coding (AAC) standard. Our goal is to study if a small set of coding metadata contained in the AAC compressed bit stream is sufficient to detect synthetic speech. This would avoid decompressing of the speech signals before analysis. We call our proposed method AAC Synthetic Speech Detection (ASSD). ASSD extracts information from the AAC compressed bit stream without decompressing the speech signal. ASSD analyzes the information using a transformer neural network. In our experiments, we compressed the ASVspoof2019 dataset according to the AAC standard using different data rates. We compared the performance of ASSD to a time domain based and a spectrogram based synthetic speech detection methods. We evaluated ASSD on approximately 71k compressed speech signals. The results show that our proposed method typically only requires 1000 bits per speech block/frame from the AAC compressed bit stream to detect synthetic speech. This is much lower than other reported methods. Our method also had a 9.7 percentage points higher detection accuracy compared to existing methods.
Efficient coding of speech and audio in a distributed system requires that quantization errors across nodes are uncorrelated. Yet, with conventional methods at low bitrates, quantization levels become increasingly spa...
详细信息
Efficient coding of speech and audio in a distributed system requires that quantization errors across nodes are uncorrelated. Yet, with conventional methods at low bitrates, quantization levels become increasingly sparse, which does not correspond to the distribution of the input signal and, importantly, also reduces coding efficiency in a distributed system. We have recently proposed a distributed speech and audio codec design, which applies quantization in a randomized domain such that quantization errors are randomly rotated in the output domain. Similar to dithering, this ensures that quantization errors across nodes are uncorrelated and coding efficiency is retained. In this paper, we improve this approach by proposing faster randomization methods, with a computational complexity of O(N log N). The presented experiments demonstrate that the proposed randomizations yield uncorrelated signals, that perceptual quality is competitive, and that the complexity of the proposed methods is feasible for practical applications.
Echo-hiding has been widely studied for audio watermarking. This study proposes a more secure echo-hiding method based on modified pseudo-noise (PN) sequence and robust principal component analysis (RPCA). In the prop...
详细信息
Echo-hiding has been widely studied for audio watermarking. This study proposes a more secure echo-hiding method based on modified pseudo-noise (PN) sequence and robust principal component analysis (RPCA). In the proposed method, the RPCA is used to decompose the original audio signal into low-rank and sparse parts and then a pair of opposite modified PN sequences is employed to embed watermarks. The modified PN sequence improves the robustness of watermark detection by providing additional correlation peaks. Meanwhile, benefit from the RPCA and the opposite PN sequences, the security of the proposed method is improved since watermarks cannot be detected from the whole signal even if the PN sequence is known, which is an obvious improvement compared with the previous PN-based echo-hiding methods. In the watermark detection process, the authors make use of the low-rank and sparse characteristics of the watermarked signal to detect watermarks from the low-rank and sparse parts, respectively. Based on this basic framework, they also propose a multi-bit embedding scheme, which obtains a doubled embedding capacity compared with the previous PN-based echo-hiding methods. The proposed method was evaluated with respect to inaudibility, security, and robustness. The experiment results verified the effectiveness of the proposed method.
In the early days, consumption of multimedia content related with audio signals was only possible in a stationary manner. The music player was located at home, with a necessary physical drive. An alternative way for a...
详细信息
In the early days, consumption of multimedia content related with audio signals was only possible in a stationary manner. The music player was located at home, with a necessary physical drive. An alternative way for an individual was to attend a live performance at a concert hall or host a private concert at home. To sum up, audio-visual effects were only reserved for a narrow group of recipients. Today, thanks to portable players, vision and sound is at last available for everyone. Finally, thanks to multimedia streaming platforms, every music piece or video, e.g. from one's favourite artist or band, can be viewed anytime and everywhere. The background or status of an individual is no longer an issue. Each person who is connected to the global network can have access to the same resources. This paper is focused on the consumption of multimedia content using mobile devices. It describes a year to year user case study carried out between 2015 and 2019, and describes the development of current trends related with the expectations of modern users. The goal of this study is to aid policymakers, as well as providers, when it comes to designing and evaluating systems and services.
At low bitrates, next generation audio coders apply waveform preserving transform coding only for the perceptually most relevant parts of the signal. The resulting spectral gaps are filled in the decoder through techn...
详细信息
ISBN:
(纸本)9781467369985
At low bitrates, next generation audio coders apply waveform preserving transform coding only for the perceptually most relevant parts of the signal. The resulting spectral gaps are filled in the decoder through techniques like Intelligent Gap Filling (IGF). IGF is currently being standardized in MPEG-H 3D-audio and also in 3GPP Enhanced Voice Service (EVS). In IGF processing, spectral tiles are copied from a spectral source location into a target location and subsequently adapted by parameter steered post-processing to best match relevant properties of the original signal. Important properties include the spectral and temporal envelope. Since IGF operates on Modified Discrete Cosine Transform (MDCT) spectra of rather long time blocks, temporal envelope shaping is not trivial. In this paper, Temporal Tile Shaping (TTS) is presented. TTS is based on linear prediction in the MDCT domain for shaping the temporal structure of the gap filling signal in the target tiles with sub-block granularity. A listening test demonstrates the advantage of the proposed method.
This study proposes improvements to 22.2 multichannel (22.2 ch) sound broadcasting service. 22.2 ch sound is currently used in the 8K satellite broadcasting in Japan. In this study, the audio system is migrated from c...
详细信息
This study proposes improvements to 22.2 multichannel (22.2 ch) sound broadcasting service. 22.2 ch sound is currently used in the 8K satellite broadcasting in Japan. In this study, the audio system is migrated from channel-based audio to object-based audio. The object-based audio equips 22.2 ch sound with alternative and adaptive functionalities: the alternative functionality is related to dialogue controls such as multilingual services, while the adaptive functionality enables 22.2 ch sound to be adapted to the audio format of the playback equipment. Moving Picture Experts Group (MPEG)-H 3D audio (3DA), which is the latest audio coding standard, is used as the audio coding scheme. A real-time encoder and decoder based on 3DA was developed to verify the practicability of the proposed system. The encoded audio data is packetized and transmitted by MPEG-H MPEG Media Transport (MMT) to be multiplexed with video data. A transmission experiment with 8K video was carried out in which the proposed system was proved to operate as designed in this study.
暂无评论