检索结果-内蒙古大学图书馆

PITCH-AND-SPECTRUM-AWARE SINGING QUALITY ASSESSMENT WITH BIAS CORRECTION AND MODEL FUSION

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Shi, Yu-Fei Ai, Yang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities. © 2024, CC BY.

关键词： Self-supervised learning

Designing and Evaluating speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP 48

学校读者我要写书评

暂无评论

Designing and Evaluating Speech Emotion Recognition Systems:...

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Antoniou, Nikolaos Katsamanis, Athanasios Giannakopoulos, Theodoros Narayanan, Shrikanth Behavioral Signal Technologies Los AngelesCA United States Athena Research Center Institute for Language and Speech Processing Athens Greece SAIL-University of Southern California Los AngelesCA United States

ISBN: (纸本)9781728163277

There is an imminent need for guidelines and standard test sets to allow direct and fair comparisons of speech emotion recognition (SER). While resources, such as the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, have emerged as widely-adopted reference corpora for researchers to develop and test models for SER, published work reveals a wide range of assumptions and variety in its use that challenge reproducibility and generalization. Based on a critical review of the latest advances in SER using IEMOCAP as the use case, our work aims at two contributions: First, using an analysis of the recent literature, including assumptions made and metrics used therein, we provide a set of SER evaluation guidelines. Second, using recent publications with open-sourced implementations, we focus on reproducibility assessment in SER. © 2023 IEEE.

关键词： Emotion Recognition

Long-Form speech Translation through Segmentation with Finite-State Decoding Constraints on Large language Models

学校读者我要写书评

暂无评论

arXiv 2023年

作者： McCarthy, Arya D. Zhang, Hao Kumar, Shankar Stahlberg, Felix Wu, Ke Center for Language and Speech Processing Johns Hopkins University United States Google Research

One challenge in speech translation is that plenty of spoken content is long-form, but short units are necessary for obtaining high-quality translations. To address this mismatch, we adapt large language models (LLMs) to split long ASR transcripts into segments that can be independently translated so as to maximize the overall translation quality. We overcome the tendency of hallucination in LLMs by incorporating finite-state constraints during decoding;these eliminate invalid outputs without requiring additional training. We discover that LLMs are adaptable to transcripts containing ASR errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art automatic punctuation baseline, our best LLM improves the average BLEU by 2.9 points for English-German, English-Spanish, and English-Arabic TED talk translation in 9 test sets, just by improving segmentation. © 2023, CC BY.

关键词： Decoding

Low-Latency Neural speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for speech Generation Tasks

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ai, Yang Ling, Zhen-Hua The National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks. Copyright © 2024, The Authors. All rights reserved.

关键词： Group delay

Towards robust one-shot voice conversion with cycle phonetic posteriorgrams and multi-scale speaker representations 24

学校读者我要写书评

暂无评论

Towards robust one-shot voice conversion with cycle phonetic...

24th International Congress on Acoustics, ICA 2022

作者： Chen, Yannian Liu, Lijuan Hu, Yajun Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China IFLYTEK Research IFLYTEK Co. Ltd. China

One-shot voice conversion (VC) aims to convert the voice across arbitrary speakers even unseen during training, with only one reference utterance from the target speaker. It is still a challenging task as both content and speaker representations estimated from speech are required to be reliable. In this paper, we propose a novel method which combines phonetic posteriorgrams (PPGs) and multi-scale speaker representations to achieve robust one-shot VC. PPGs are extracted by a pretrained automatic speech recognition (ASR) model and contain robust linguistic information. Cycle PPGs which are generated from a cycle conversion process are used for training to eliminate the influence of residual speaker information in PPGs. Furthermore, multi-scale speaker representations composed of global and local ones are utilized. Global speaker representations are modeled by an advanced speaker embedding network which integrates squeeze-excitation blocks and attentive statistics pooling to get utterance-level vectors. In order to extract time-varying and content-dependent local speaker representations, an attention mechanism is adopted to select the most suitable features depending on each content frame, which is expected to refine the coarse speaker information given by utterance-level speaker representations. Experimental results showed that the proposed method outperformed baseline methods on one-shot VC. © 2022 Proceedings of the International Congress on Acoustics. All Rights Reserved.

关键词： speech recognition

APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ai, Yang Jiang, Xiao-Hang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed of an encoder and a decoder with the modified ConvNeXt v2 network as the backbone, connected by a quantizer based on the residual vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude and phase spectra in parallel, amalgamating them into a continuous latent code at a reduced temporal resolution. This code is subsequently quantized by the quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase spectra in parallel, and the decoded waveform is obtained by inverse short-time Fourier transform. To ensure the fidelity of decoded audio like waveform codecs, spectral-level loss, quantization loss, and generative adversarial network (GAN) based loss are collectively employed for training the APCodec. To support low-latency streamable inference, we employ feed-forward layers and causal deconvolutional layers in APCodec, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio. Experimental results confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6 kbps, with no significant degradation in the quality of the decoded audio. At the same bitrate, our proposed APCodec also demonstrates superior decoded audio quality and faster generation speed compared to well-known codecs, such as Encodec, AudioDec and DAC. Copyright © 2024, The Authors. All rights reserved.

关键词： Vector quantization

Towards High-Quality and Efficient speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Lu, Ye-Xin Ai, Yang Du, Hui-Peng Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the source narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods. Copyright © 2024, The Authors. All rights reserved.

关键词： Discriminators

Aligning Noisy-Clean speech Pairs at Feature and Embedding Levels for Learning Noise-Invariant Speaker Representations

学校读者我要写书评

暂无评论

Aligning Noisy-Clean Speech Pairs at Feature and Embedding L...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zuoliang Li Yang Ai Jie Zhang Shengyu Peng Yu Guan Bin Gu Wu Guo The National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

In this paper, we propose a noise-invariant speaker representation learning (SRL) approach by aligning noisy-clean speech pairs at both the feature and embedding levels for model training. Specifically, we first construct noisy-clean pairs using data augmentation during training. The noisy features are then processed by a Conformer-based enhancement module. The feature-level alignment is achieved by minimizing the mean squared error between the enhanced and original clean data. At the embedding level, we introduce a supervised contrastive learning loss with noise-adaptive margin to simultaneously enhance the intra-speaker compactness and the inter-speaker separability and better adapt different noise levels, in combination with the Barlow Twins self-supervised loss to align the noisy-clean data pairs and reduce noise redundancy in the embedding space. Finally, these loss components are integrated with conventional classification loss to train the SRL network. Experimental results on various VoxCeleb1 test sets synthesized with noise sources demonstrate the effectiveness of the proposed method.

关键词： Training Representation learning Noise Redundancy Contrastive learning Feature extraction Data augmentation Noise measurement speech processing Noise level

Recursive Feature Learning from Pre-Trained Models for Spoofing speech Detection

学校读者我要写书评

暂无评论

Recursive Feature Learning from Pre-Trained Models for Spoof...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Yu Guan Yang Ai Zuoliang Li Shengyu Peng Wu Guo National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

It was recently revealed that using features extracted from pre-trained models can achieve much better performance than using conventional hand-crafted acoustic features for spoofing speech detection. In this paper, we therefore enhance the features from pre-trained model based on recursive learning. Specifically, we modify the pre-trained model by feeding the features from the topmost transformer layer to bottom layers recursively, and the obtained recursive features from the bottom layers are fused with that from topmost layer. The fused features are then fed into the backend classifiers. Experiments are carried out on two benchmark datasets (i.e., ASVspoof 2019 LA and ASVspoof 2021 LA), which show the superiority of the proposed method over state-of-the-art systems.

关键词： Voice activity detection Representation learning Linear regression Signal processing Benchmark testing Feature extraction Transformers Excavation Acoustics