检索结果-内蒙古大学图书馆

PNP-RKD: A Positive-Negative Pair based Relational Knowledge Distillation Method for Cross-Domain Speaker Verification

学校读者我要写书评

暂无评论

PNP-RKD: A Positive-Negative Pair based Relational Knowledge...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qing Gu Yan Song Nan Jiang Pengfei Cai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China ICT Cluster Singapore Institute of Technology Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Existing deep embedding learning based speaker verification (SV) methods suffer from performance degradation under domain shift conditions. This can be alleviated through unsupervised domain adaptation (UDA) techniques. While UDA improves global statistical consistency across domains, discriminative information may be overlooked or misaligned in the process. To combat this, we propose PNP-RKD, a relational knowledge distillation method that utilizes positive and negative pairs from both the source and target domains within a multitask learning framework. Two auxiliary tasks are conducted separately in the source and target domains to support PNP-RKD. Embeddings are learned in a supervised fashion from the labeled source domain, providing a robust foundation of prior knowledge. For the unlabeled target domain, we apply contrastive learning based on swapped prediction, a key component that enhances noise robustness and improves the quality of learned prototypes. More importantly, it facilitates reliable sampling in PNP-RKD, thereby enhancing the alignment of discriminative knowledge across domains. Extensive experiments conducted on the NIST SRE16 and SRE18 datasets demonstrate the superior performance of the proposed PNP-RKD method, achieving EERs of 6.83% and 8.28%, respectively.

关键词： Degradation Prototypes Contrastive learning NIST Signal processing Multitasking Acoustics Noise robustness Reliability speech processing

Exploring language-Agnostic speech Representations Using Domain Knowledge for Detecting Alzheimer’s Dementia

学校读者我要写书评

暂无评论

Exploring Language-Agnostic Speech Representations Using Dom...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zehra Shah Shi-Ang Qi Fei Wang Mahtab Farrokh Mashrura Tasnim Eleni Stroulia Russell Greiner Manos Plitsis Athanasios Katsamanis Department of Computing Science University of Alberta Edmonton Canada Institute for Language and Speech Processing Athena Research Center Greece

We explore ways to use speech data to screen for indications of Alzheimer’s dementia (AD). In particular, we describe our approach to the ICASSP 2023 Signal processing Grand Challenge, which involves extrapolating from models learned from English speech samples, to Greek speech samples, to determine which subjects have AD. By using acoustic and linguistic features, inspired by clinical research on AD, our top-performing classification model achieves 69% accuracy in distinguishing AD patients from healthy controls, and our regression model attains an RMSE of 4.8 for inferring cognitive testing scores. These outcomes underscore the potential of our explainable model for detecting cognitive decline in AD patients via speech, and its applicability in clinical settings.

关键词： Signal processing Linguistics Acoustics speech processing Alzheimer's disease Testing

Long-frame-shift Neural speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Ai, Yang Lu, Ye-Xin Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms. Copyright © 2023, The Authors. All rights reserved.

关键词： Error compensation

Incorporating Ultrasound Tongue Images for Audio-Visual speech Enhancement

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most. Copyright © 2023, The Authors. All rights reserved.

关键词： Ultrasonics

language-Independent Prosody-Enhanced speech Representations For Multilingual speech Synthesis

学校读者我要写书评

暂无评论

Language-Independent Prosody-Enhanced Speech Representations...

IEEE Spoken language Technology Workshop

作者： Chang Liu Zhen-Hua Ling Ya-Jun Hu National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China iFLYTEK Co. Ltd. China

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

This paper proposes language-independent prosody-enhanced speech representations to improve the naturalness of speech synthesis for the target languages that lack prosodic labels. To build text-to-speech (TTS) systems for low-resource languages, recent studies have employed the representations extracted from self-supervised learning (SSL) speech models, such as wav2vec 2.0, as intermediate representations in TTS models. However, they have generally focused only on the linguistic and phonetic information in SSL representations, disregarding the prosodic information. This paper investigates the prosodic information contained in the multilingual wav2vec 2.0 model through layer-wise probing tests utilizing acoustic prosodic features and prosodic labels. Furthermore, we propose a language-independent prosody enhancement approach to improve the prosodic properties of SSL models. The proposed method introduces a prosodic label prediction loss to fine-tune wav2vec 2.0 model with multilingual prosody-annotated corpora. From the fine-tuned wav 2 vec 2.0 model, the language-independent prosody-enhanced speech representations are extracted and serve as intermediate representations of our acoustic model in the downstream TTS task. The experimental results on six target languages demonstrate that our proposed prosody-enhanced speech representations outperform the original wav2vec 2.0 representations without enhancement.

关键词： Correlation Conferences Self-supervised learning speech enhancement Predictive models Phonetics Transformers Acoustics Multilingual Text to speech

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality speech Enhancement

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua The National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech. Remarkably, for the speech denoising task, the proposed MP-SENet yields a PESQ of 3.60 on the VoiceBank+DEMAND dataset and 3.62 on the DNS challenge dataset. Copyright © 2023, The Authors. All rights reserved.

关键词： speech intelligibility

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cai, Pengfei Song, Yan Jiang, Nan Gu, Qing McLoughlin, Ian National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model (PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique. © 2024, CC BY.

关键词： Self-supervised learning

Adapted Multimodal Bert with Layer-Wise Fusion for Sentiment Analysis

学校读者我要写书评

暂无评论

Adapted Multimodal Bert with Layer-Wise Fusion for Sentiment...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Odysseas S. Chlapanis Georgios Paraskevopoulos Alexandros Potamianos National Technical University of Athens Athens Greece Athena Research Center Institute for Language and Speech Processing Athens Greece

Multimodal learning pipelines have benefited from the success of pretrained language models. However, this comes at the cost of increased model parameters. In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers. The adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. During the adaptation process the pre-trained language model parameters remain frozen, allowing for fast, parameter-efficient training. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise. Our experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the current state-of-the-art across metrics, with 3.4% relative reduction in the resulting error and 2.1% relative improvement in 7−class classification accuracy.

关键词： Training Adaptation models Sentiment analysis Costs Bit error rate Signal processing Transformers

An investigation of phrase break prediction in an End-to-End TTS system

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Vadapalli, Anandaswarup Speech Processing Lab Language Technologies Research Center International Institute of Information Technology Telangana Hyderabad500032 India

Purpose: This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-speech (TTS) systems. Methods: The effectiveness of these models is evaluated based on listener preferences in subjective tests. Two approaches are explored: (1) a bidirectional LSTM model with task-specific embeddings trained from scratch, and (2) a pre-trained BERT model fine-tuned on phrase break prediction. Both models are trained on a multi-speaker English corpus to predict phrase break locations in text. The End-to-End TTS system used comprises a Tacotron2 model with Dynamic Convolutional Attention for mel spectrogram prediction and a WaveRNN vocoder for waveform generation Results: The listening tests show a clear preference for text synthesized with predicted phrase breaks over text synthesized without them. Conclusion: These results confirm the value of incorporating external phrasing models within End-to-End TTS to enhance listener comprehension. © 2023, CC BY.

关键词： Subjective testing