检索结果-内蒙古大学图书馆

International Symposium on Chinese Spoken language processing

作者： Rui Feng Yin-Long Liu Zhen-Hua Ling Jia-Hong Yuan National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China Interdisciplinary Research Center for Linguistic Sciences University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

speech fundamental frequency (F0) extraction is one of the most important tasks in speech signal processing. This paper aims to explore the feasibility of using deep learning for speech fundamental frequency extraction. Our approach, Wav2f0, combines the Wav2vec 2.0 model with fully connected and LSTM layers, leveraging the pretrained representations learned by Wav2vec 2.0. We conduct training and evaluation on a Vietnamese tone production corpus, which contains parallel recordings of Electroglottograph (EGG) and microphone signals. Wav2f0 outperforms Praat in pitch extraction accuracy on the corpus, especially in scenarios where Praat fails to estimate pitch. The Gross Pitch Error (GPE) of Wav2f0 is 7.1%, representing a more than 50% error reduction compared to Praat's 15.5%.

关键词： Deep learning Training Measurement Production Feature extraction Recording Multilingual speech processing Long short term memory Microphones

来源：评论

学校读者我要写书评

暂无评论

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero...

引用

2022 Conference on Empirical Methods in Natural language processing, EMNLP 2022

作者： Ma, Jun-Yu Chen, Beiduo Gu, Jia-Chen Ling, Zhen-Hua Guo, Wu Liu, Quan Chen, Zhigang Liu, Cong National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China State Key Laboratory of Cognitive Intelligence China iFLYTEK Research Hefei China Jilin Kexun Information Technology Co. Ltd China

Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields. © 2022 Association for Computational Linguistics.

关键词： Distillation

来源：评论

学校读者我要写书评

暂无评论

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Expressive-VC: Highly Expressive Voice Conversion with Atten...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Ziqian Ning Qicong Xie Pengcheng Zhu Zhichao Wang Liumeng Xue Jixun Yao Lei Xie Mengxiao Bi Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Fuxi AI Lab NetEase Inc. Hangzhou China

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.

关键词： Fuses Perturbation methods Linguistics Signal processing Feature extraction Acoustics Decoding

来源：评论

学校读者我要写书评

暂无评论

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural N...

引用

IEEE Workshop on Automatic speech Recognition and Understanding

作者： Dake Guo Xinfa Zhu Liumeng Xue Tao Li Yuanjun Lv Yuepeng Jiang Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China School of Data Science The Chinese University of Hong Kong Shenzhen (CUHK-Shenzhen) China

Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNNTTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech 1 . 1 speech samples: https://***/HiGNN-TTS/

关键词：

来源：评论

学校读者我要写书评

暂无评论

Entity Linking in the Job Market Domain

arXiv

引用

arXiv 2024年

作者： Zhang, Mike van der Goot, Rob Plank, Barbara Department of Computer Science IT University of Copenhagen Denmark Pioneer Centre for Artificial Intelligence Copenhagen Denmark MaiNLP Center for Information and Language Processing LMU Munich Germany Munich Germany

In Natural language processing, entity linking (EL) has centered around Wikipedia, but remains underexplored for the job market domain. Disambiguating skill mentions can help us to get insight into the labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k). Copyright © 2024, The Authors. All rights reserved.

关键词： Taxonomies

来源：评论

学校读者我要写书评

暂无评论

Distance-Based Weight Transfer for Fine-Tuning From Near-Field to Far-Field Speaker Verification

Distance-Based Weight Transfer for Fine-Tuning From Near-Fie...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Li Zhang Qing Wang Hongji Wang Yue Li Wei Rao Yannan Wang Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University (NPU) Xi’an China Tencent Ethereal Audio Lab Tencent Corporation Shenzhen China

The scarcity of labeled far-field speech is a constraint for training superior far-field speaker verification systems. In general, fine-tuning the model pre-trained on large-scale near- field speech through a small amount of far-field speech substantially outperforms training from scratch. However, the vanilla fine-tuning suffers from two limitations – catastrophic forgetting and overfitting. In this paper, we propose a weight transfer regularization (WTR) loss to constrain the distance of the weights between the pre-trained model and the fine-tuned model. With the WTR loss, the fine-tuning process takes advantage of the previously acquired discriminative ability from the large-scale near-field speech and avoids catastrophic for- getting. Meanwhile, the analysis based on the PAC-Bayes generalization theory indicates that the WTR loss makes the fine-tuned model have a tighter generalization bound, thus mitigating the overfitting problem. Moreover, three different norm distances for weight transfer are explored, which are L1-norm distance, L2-norm distance, and Max-norm distance. We evaluate the effectiveness of the WTR loss on VoxCeleb (pre-trained) and FFSVC (fine-tuned) datasets. Experimental results show that the distance-based weight transfer fine-tuning strategy significantly outperforms vanilla fine- tuning and other competitive domain adaptation methods.

关键词： Training Analytical models Adaptation models Signal processing Acoustics speech processing Tuning

来源：评论

学校读者我要写书评

暂无评论

Few-Shot Keyword Spotting from Mixed speech

arXiv

引用

arXiv 2024年

作者： Yuan, Junming Shi, Ying Li, LanTian Wang, Dong Hamdulla, Askar School of Computer Science and Technology Xinjiang University China School of Artificial Intelligence Beijing University of Posts and Telecommunications China Center for Speech and Language Technologies BNRist Tsinghua University China School of Computer Science and Technology Harbin Institute of Technology China

Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting – simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the Librispeech and Google speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining SSL-based large-scale pre-training (HuBert) and MT fine-tuning yields very strong results in all the test conditions. © 2024, CC BY-NC-SA.

关键词：

来源：评论

学校读者我要写书评

暂无评论

EDSep: An Effective Diffusion-Based Method for speech Source Separation

arXiv

引用

arXiv 2025年

作者： Dong, Jinwei Wang, Xinsheng Mao, Qirong School of Computer Science and Communication Engineering Jiangsu University China Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications China Provincial Key Laboratory of Computational Intelligence and New Technologies in Low-Altitude Digital Agriculture Zhenjiang China Audio Speech and Language Processing Group School of Computer Science Northwestern Polytechnical University Xi’an China

Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation tasks, their adaptation to speech separation has encountered challenges, notably slow convergence and suboptimal separation outcomes. To address these issues and enhance the efficacy of diffusion-based speech separation, we introduce EDSep, a novel single-channel method grounded in score matching via stochastic differential equation (SDE). This method enhances generative modeling for speech source separation by optimizing training and sampling efficiency. Specifically, a novel denoiser function is proposed to approximate data distributions, which obtains ideal denoiser outputs. Additionally, a stochastic sampler is carefully designed to resolve the reverse SDE during the sampling process, gradually separating speech from mixtures. Extensive experiments on databases such as WSJ0-2mix, LRS2-2mix, and VoxCeleb2-2mix demonstrate our proposed method’s superior performance over existing diffusion and discriminative models, validating its efficacy. Copyright © 2025, The Authors. All rights reserved.

关键词： Stochastic systems

来源：评论

学校读者我要写书评

暂无评论

A Semantics-Aware Normalizing Flow Model for Anomaly Detection

A Semantics-Aware Normalizing Flow Model for Anomaly Detecti...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Wei Ma Shiyong Lan Weikang Huang Wenwu Wang Hongyu Yang Yitong Ma Yongjie Ma College of Computer Science Sichuan University China National Key Laboratory of Fundamental Science on Synthetic Vision China Center for Vision Speech and Signal Processing University of Surrey UK

Anomaly detection in computer vision aims to detect outliers from input image data. Examples include texture defect detection and semantic discrepancy detection. However, existing methods are limited in detecting both types of anomalies, especially for the latter. In this work, we propose a novel semantics-aware normalizing flow model to address the above challenges. First, we employ the semantic features extracted from a backbone network as the initial input of the normalizing flow model, which learns the mapping from the normal data to a normal distribution according to semantic attributes, thus enhances the discrimination of semantic anomaly detection. Second, we design a new feature fusion module in the normalizing flow model to integrate texture features and semantic features, which can substantially improve the fitting of the distribution function with input data, thus achieving improved performance for the detection of both types of anomalies. Extensive experiments on five well-known datasets for semantic anomaly detection show that the proposed method outperforms the state-of-the-art baselines. The codes will be available at https://***/SYLan2019/SANF-AD.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Joint Generative-Contrastive Representation Learning for Ano...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Xiao-Min Zeng Yan Song Zhu Zhuo Yu Zhou Yu-Hong Li Hui Xue Li-Rong Dai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China Alibaba Group China ICT Cluster Singapore Institute of Technology Singapore

In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods.

关键词： Representation learning Self-supervised learning Signal processing Predictive models Multitasking Robustness Acoustics

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：