检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Du, Hui-Peng Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper proposes a novel neural denoising vocoder that can generate clean speech waveforms from noisy mel-spectrograms. The proposed neural denoising vocoder consists of two components, i.e., a spectrum predictor and a enhancement module. The spectrum predictor first predicts the noisy amplitude and phase spectra from the input noisy mel-spectrogram, and subsequently the enhancement module recovers the clean amplitude and phase spectrum from noisy ones. Finally, clean speech waveforms are reconstructed through inverse short-time Fourier transform (iSTFT). All operations are performed at the frame-level spectral domain, with the APNet vocoder and MP-SENet speech enhancement model used as the backbones for the two components, respectively. Experimental results demonstrate that our proposed neural denoising vocoder achieves state-of-the-art performance compared to existing neural vocoders on the VoiceBank+DEMAND dataset. Additionally, despite the lack of phase information and partial amplitude information in the input mel-spectrogram, the proposed neural denoising vocoder still achieves comparable performance with the serveral advanced speech enhancement methods. Copyright © 2024, The Authors. All rights reserved.

关键词： Spectrographs

来源：评论

学校读者我要写书评

暂无评论

Refining Self-Supervised Learnt speech Representation using Brain Activations

arXiv

引用

arXiv 2024年

作者： Li, Hengyu Mei, Kangdi Liu, Zhaoci Ai, Yang Chen, Liping Zhang, Jie Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification. One can then consider the proposed method as a new alternative to improve self-supervised speech models. Copyright © 2024, The Authors. All rights reserved.

关键词： Chemical activation

来源：评论

学校读者我要写书评

暂无评论

PITCH-AND-SPECTRUM-AWARE SINGING QUALITY ASSESSMENT WITH BIAS CORRECTION AND MODEL FUSION

arXiv

引用

arXiv 2024年

作者： Shi, Yu-Fei Ai, Yang Lu, Ye-Xin Du, Hui-Peng Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities. © 2024, CC BY.

关键词： Self-supervised learning

来源：评论

学校读者我要写书评

暂无评论

SQ-Whisper: Speaker-Querying Based Whisper Model for Target-Speaker ASR

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2024年 33卷 175-185页

作者： Pengcheng Guo Xuankai Chang Hang Lv Shinji Watanabe Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi'an China Carnegie Mellon University Pittsburgh PA USA

Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

关键词： Adaptation models Feature extraction Data models Transformers Training speech processing Vectors Predictive models Decoding Target recognition

来源：评论

学校读者我要写书评

暂无评论

Effective Integration of Text Diffusion and Pre-Trained language Models with Linguistic Easy-First Schedule 30

Effective Integration of Text Diffusion and Pre-Trained Lang...

引用

Joint 30th International Conference on Computational Linguistics and 14th International Conference on language Resources and Evaluation, LREC-COLING 2024

作者： Ou, Yimin Jian, Ping School of Computer Science and Technology Beijing Institute of Technology Beijing China Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications Beijing Institute of Technology Beijing China

ISBN: (纸本)9782493814104

Diffusion models have become a powerful generative modeling paradigm, achieving great success in continuous data patterns. However, the discrete nature of text data results in compatibility issues between continuous diffusion models (CDMs) and pre-trained language models (PLMs). That is, the performance of diffusion models even degrades when combined with PLMs. To alleviate this issue, we propose to utilize a pre-trained decoder to convert the denoised embedding vectors into natural language instead of using the widely used rounding operation. In this way, CDMs can be more effectively combined with PLMs. Additionally, considering that existing noise schedules in text diffusion models do not take into account the linguistic differences among tokens, which violates the easy-first policy for text generation, we propose a linguistic easy-first schedule that incorporates the measure of word importance, conforming to easy-first-generation linguistic features and bringing about improved generation quality. Experiment results on the E2E dataset and five controllable tasks show that our approach can combine the merits of CDMs and PLMs, significantly outperforming other diffusion-based models. © 2024 ELRA language Resource Association: CC BY-NC 4.0.

关键词： Diffusion

来源：评论

学校读者我要写书评

暂无评论

Improving Implicit Discourse Relation Recognition with Semantics Confrontation 30

Improving Implicit Discourse Relation Recognition with Seman...

引用

Joint 30th International Conference on Computational Linguistics and 14th International Conference on language Resources and Evaluation, LREC-COLING 2024

作者： Cai, Mingyang Yang, Zhen Jian, Ping School of Computer Science and Technology Beijing Institute of Technology Beijing China Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications Beijing Institute of Technology Beijing China

ISBN: (纸本)9782493814104

Implicit Discourse Relation Recognition (IDRR), which infers discourse logical relations without explicit connectives, is one of the most challenging tasks in natural language processing (NLP). Recently, pre-trained language models (PLMs) have yielded impressive results across numerous NLP tasks, but their performance still remains unsatisfactory in IDRR. We argue that prior studies have not fully harnessed the potential of PLMs, thereby resulting in a mixture of logical semantics, which determine the logical relations between discourse arguments, and general semantics, which encapsulate the non-logical contextual aspects (detailed in Sec.1). Such a mixture would inevitably compromise the logic reasoning ability of PLMs. Therefore, we propose a novel method that trains the PLMs through two semantics enhancers to implicitly differentiate logical and general semantics, ultimately achieving logical semantics enhancement. Due to the characteristic of PLM in word representation learning, these two semantics enhancers will inherently confront with each other, facilitating an augmentation of logical semantics by disentangling them from general semantics. The experimental results on PDTB 2.0 dataset show that the confrontation approach exceeds our baseline by 3.81% F1 score, and the effectiveness of the semantics confrontation method is validated by comprehensive ablation experiments. © 2024 ELRA language Resource Association: CC BY-NC 4.0.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Enhancing Lip Reading with Multi-Scale Video and Multi-Encod...

引用

IEEE International Conference on Multimedia and Expo Workshops (ICMEW)

作者： He Wang Pengcheng Guo Xucheng Wan Huan Zhou Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China IT Innovation and Research Center Huawei Technologies

ISBN: (数字)9798350379815

ISBN: (纸本)9798350379822

Automatic lip-reading (ALR) aims to automatically tran-scribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first introduce a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and propose an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branch-former and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.

关键词： Visualization Error analysis Text recognition Lips System performance Perturbation methods speech recognition

来源：评论

学校读者我要写书评

暂无评论

PQLM - Multilingual Decentralized Portable Quantum language Model 48

PQLM - Multilingual Decentralized Portable Quantum Language ...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Li, Shuyue Stella Zhang, Xiangyu Zhou, Shu Shu, Hongchao Liang, Ruixing Liu, Hexin Garcia, Leibny Paola Hong Kong University of Science and Technology Department of Physics Hong Kong Nanyang Technological University School of Electrical and Electronic Engineering Singapore Johns Hopkins University Center for Language and Speech Processing United States Johns Hopkins University Human Language Technology Center of Excellence United States

ISBN: (纸本)9781728163277

With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly portable quantum language model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with random Variational Quantum Classifiers (VQC) and local models for downstream applications. We demonstrate the ad hoc portability of the quantum model by extracting only the word embeddings and effectively applying them to downstream tasks on classical machines. Our PQLM exhibits comparable performance to its classical counterpart on both intrinsic evaluation (loss, perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy) metrics. We also perform ablation studies on the factors affecting PQLM performance to analyze model stability. Our work establishes a theoretical foundation for a portable quantum pre-trained language model that could be trained on private data and made available for public use with privacy protection guarantees. © 2023 IEEE.

关键词： Federated Learning language Modeling Model Portability Quantum Machine Learning

来源：评论

学校读者我要写书评

暂无评论

Adaptive Data Augmentation with Naturalspeech3 for Far-field Speaker Verification

Adaptive Data Augmentation with NaturalSpeech3 for Far-field...

引用

IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

作者： Li Zhang Jiyao Liu Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University (NPU) Xi’an China

ISBN: (数字)9798350386226

ISBN: (纸本)9798350386233

The scarcity of speaker-annotated far-field speech presents a significant challenge in developing high-performance far-field speaker verification (SV) systems. While data augmentation using large-scale near-field speech has been a common strategy to address this limitation, the mismatch in acoustic environments between near-field and far-field speech significantly hinders the improvement of far-field SV effectiveness. In this paper, we propose an adaptive speech augmentation approach leveraging Naturalspeech3, a pre-trained foundation text-to-speech (TTS) model, to convert near-field speech into far-field speech by incorporating far-field acoustic ambient noise for data augmentation. Specifically, we utilize FACodec from Naturalspeech3 to decompose the speech waveform into distinct embedding subspaces —content, prosody, speaker, and residual (acoustic details) embeddings—and reconstruct the speech waveform from these disentangled representations. In our method, the prosody, content, and residual embeddings of far-field speech are combined with speaker embeddings from near-field speech to generate augmented pseudo far-field speech that maintains the speaker identity from the out-domain near-field speech while preserving the acoustic environment of the in-domain far-field speech. This approach not only serves as an effective strategy for augmenting training data for far-field speaker verification but also extends to cross-data augmentation for enrollment and test speech in evaluation trials. In augmentation of enrollment and test utterances, the method mitigates performance degradation caused by discrepancies in text content or environmental noise between enrollment and test data. This data augmentation method, which preserves the acoustic environment of the in-domain far-field data, qualifies as an adaptive augmentation method. Experimental results on FFSVC demonstrate that the adaptive data augmentation method significantly outperforms traditional approaches, such a

关键词： Degradation Adaptive systems Working environment noise Noise Training data Data augmentation Acoustics Data models Text to speech Reverberation

来源：评论

学校读者我要写书评

暂无评论

Low-Latency Neural speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for speech Generation Tasks

arXiv

引用

arXiv 2024年

作者： Ai, Yang Ling, Zhen-Hua The National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks. Copyright © 2024, The Authors. All rights reserved.

关键词： Group delay

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：