检索结果-内蒙古大学图书馆

Self-supervised Prosody Learning at Phoneme-level with Momentum Contrast for speech Synthesis

学校读者我要写书评

暂无评论

Self-supervised Prosody Learning at Phoneme-level with Momen...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zhao-Ci Liu Ya-Jun Hu Liping Chen Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R.China iFLYTEK Research iFLYTEK Co. Ltd. China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

This paper investigates leveraging large-scale speech data to enhance prosodic modeling in speech synthesis, and introduces a model named SP2MC which achieves self-supervised prosody learning at phoneme-level with momentum contrast. This model incorporates dual convolutional encoders for speech and linear predictive coding (LPC) residual inputs to generate phoneme-level embeddings, which are masked and processed by a Transformer model to produce prosody representations. Two supervision modules are employed to generate phoneme-level supervision from speech waveforms and residuals. Momentum contrast is utilized to manage negative sample selection in contrastive learning. Finally, the SP2MC representations are integrated into a Fastspeech2-based acoustic model for speech synthesis. Experimental results indicate that the naturalness of speech synthesized by the proposed method is significantly better than that of baselines.

关键词： Convolutional codes speech coding Convolution Predictive models speech enhancement Transformers Acoustics Data models Linear predictive coding Text to speech

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition

学校读者我要写书评

暂无评论

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero...

2022 Conference on Empirical Methods in Natural language processing, EMNLP 2022

作者： Ma, Jun-Yu Chen, Beiduo Gu, Jia-Chen Ling, Zhen-Hua Guo, Wu Liu, Quan Chen, Zhigang Liu, Cong National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China State Key Laboratory of Cognitive Intelligence China iFLYTEK Research Hefei China Jilin Kexun Information Technology Co. Ltd China

Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields. © 2022 Association for Computational Linguistics.

关键词： Distillation

END-TO-END LYRICS RECOGNITION WITH SELF-SUPERVISED LEARNING

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zhang, Xiangyu Li, Shuyue Stella He, Zhanhong Togneri, Roberto Garcia, Leibny Paola Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States Department of Computer Science University of Western Australia Australia

Lyrics recognition is an important task in music processing. Despite traditional algorithms such as the hybrid HMM-TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models on lyrics recognition task. We evaluate a variety of upstream SSL models with different training methods (masked reconstruction, masked prediction, autoregressive reconstruction, and contrastive learning). Our end-to-end self-supervised models, evaluated on the DAMP music dataset, outperform the previous state-of-the-art (SOTA) system by 5.23% for the dev set and 2.4% for the test set even without a language model trained by a large corpus. Moreover, we investigate the effect of background music on the performance of self-supervised learning models and conclude that the SSL models cannot extract features efficiently in the presence of background music. Finally, we study the out-of-domain generalization ability of the SSL features considering that those models were not trained on music datasets. Copyright © 2022, The Authors. All rights reserved.

关键词： Supervised learning

Deep CLAS: Deep Contextual Listen, Attend and Spell

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wang, Mengzhi Xiong, Shifu Wan, Genshun Chen, Hang Gao, Jianqing Dai, Lirong iFLYTEK Research iFLYTEK Co. Ltd. Hefei230088 China National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

Contextual-LAS (CLAS) has been shown effective in improving Automatic speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to deeply utilize contextual information. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer. Furthermore, the bias attention score is directly utilized to correct the model’s output probability distribution. Additionally, a prefix tree is employed to prevent interference from irrelevant information. Experiments using the public AISHELL-1. Compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene. © 2024, CC BY.

关键词： Encoding (symbols)

Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual speech Synthesis

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Peng, Yukun Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method. Copyright © 2022, The Authors. All rights reserved.

关键词： speech synthesis

NEURAL speech PHASE PREDICTION BASED ON PARALLEL ESTIMATION ARCHITECTURE AND ANTI-WRAPPING LOSSES

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed. Copyright © 2022, The Authors. All rights reserved.

关键词： Group delay

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

学校读者我要写书评

暂无评论

Joint Generative-Contrastive Representation Learning for Ano...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Xiao-Min Zeng Yan Song Zhu Zhuo Yu Zhou Yu-Hong Li Hui Xue Li-Rong Dai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China Alibaba Group China ICT Cluster Singapore Institute of Technology Singapore

In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods.

关键词： Representation learning Self-supervised learning Signal processing Predictive models Multitasking Robustness Acoustics

JOINT GENERATIVE-CONTRASTIVE REPRESENTATION LEARNING FOR ANOMALOUS SOUND DETECTION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zeng, Xiao-Min Song, Yan Zhuo, Zhu Zhou, Yu Li, Yu-Hong Xue, Hui Dai, Li-Rong McLoughlin, Ian National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China ICT Cluster Singapore Institute of Technology Singapore Alibaba Group China

关键词： Germanium alloys

Can Automated speech Recognition Errors Provide Valuable Clues for Alzheimer’s Disease Detection?

学校读者我要写书评

暂无评论

Can Automated Speech Recognition Errors Provide Valuable Clu...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Yin-Long Liu Rui Feng Ye-Xin Lu Jia-Xin Chen Yang Ai Jia-Hong Yuan Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China Interdisciplinary Research Center for Linguistic Sciences University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Recent advances in automatic speech recognition (ASR) technology have boosted the viability of fully automated Alzheimer’s disease (AD) detection via ASR transcripts. However, there is a lack of understanding of how ASR errors affect the performance of AD detection. This paper addresses that gap. First, we fine-tune 18 ASR models on three datasets from DementiaBank, generating 36 ASR transcripts on the ADReSS dataset (18 from original and 18 from fine-tuned ASR models). We then employ two AD detection methods using either ASR or manual transcripts: fine-tuning four large language models (LLMs) and fusing LLMs with pre-trained language models (PLMs). The results show that certain ASR transcripts outperform manual transcripts, suggesting that ASR errors provide valuable clues for AD detection. Finally, we conduct an interpretability study, including linguistic and SHapley Additive exPlanations (SHAP) analyses. This study reveals that greater word distribution differences between AD and healthy control (HC) groups in ASR transcripts may be linked to these valuable clues. This paper highlights the potential of ASR as a powerful tool for developing fully automated AD detection systems.

关键词： Systematics Additives Large language models Manuals Signal processing Linguistics Acoustics Alzheimer's disease speech processing Automatic speech recognition