Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high de...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neuralspeech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
In this paper, we propose a high-rate extension of the SoundStream codec which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance...
详细信息
ISBN:
(纸本)9798350323726
In this paper, we propose a high-rate extension of the SoundStream codec which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when more bits are used for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings. Experimental results show the superiority of our proposed model over conventional approaches.
GAN vocoders are currently one of the state-of-the-art methods for building high-quality neural waveform generative models. However, most of their architectures require dozens of billion floating-point operations per ...
详细信息
暂无评论