检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Sorensen, Axel Peng, Siyao Plank, Barbara van der Goot, Rob Department of Computer Science IT University of Copenhagen Denmark Munich Germany MaiNLP Center for Information and Language Processing LMU Munich Germany

Annotation tools are the starting point for creating Natural language processing (NLP) datasets. There is a wide variety of tools available;setting up these tools is however a hindrance. We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.1 © 2024, CC BY.

关键词： Classification (of information)

来源：评论

学校读者我要写书评

暂无评论

Leveraging Prompt Learning and Pause Encoding for Alzheimer’s Disease Detection

arXiv

引用

arXiv 2024年

作者： Liu, Yin-Long Feng, Rui Yuan, Jia-Hong Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China Interdisciplinary Research Center for Linguistic Sciences University of Science and Technology of China Hefei China

Compared to other clinical screening techniques, speech-and-language-based automated Alzheimer’s disease (AD) detection methods are characterized by their non-invasiveness, cost-effectiveness, and convenience. Previous studies have demonstrated the efficacy of fine-tuning pre-trained language models (PLMs) for AD detection. However, the objective of this traditional fine-tuning method, which involves inputting only transcripts, is inconsistent with the masked language modeling (MLM) task used during the pre-training phase of PLMs. In this paper, we investigate prompt-based fine-tuning of PLMs, converting the classification task into a MLM task by inserting prompt templates into the transcript inputs. We also explore the impact of incorporating pause information from forced alignment into manual transcripts. Additionally, we compare the performance of various automatic speech recognition (ASR) models and select the Whisper model to generate ASR-based transcripts for comparison with manual transcripts. Furthermore, majority voting and ensemble techniques are applied across different PLMs (BERT and RoBERTa) using different random seeds. Ultimately, we obtain maximum detection accuracy of 95.8% (with mean 87.9%, std 3.3%) using manual transcripts, achieving state-of-the-art performance for AD detection using only transcripts on the ADReSS test set. Copyright © 2024, The Authors. All rights reserved.

关键词： Neurodegenerative diseases

来源：评论

学校读者我要写书评

暂无评论

DP-MAE: A Dual-Path Masked Autoencoder Based Self-Supervised Learning Method for Anomalous Sound Detection

DP-MAE: A Dual-Path Masked Autoencoder Based Self-Supervised...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zhuo-Li Liu Yan Song Xiao-Min Zeng Li-Rong Dai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China ICT Cluster Singapore Institute of Technology Singapore

In this paper, we present a novel general-purpose audio representation learning method named Dual-Path Masked AutoEncoder (DPMAE) for anomalous sound detection (ASD) task. Existing methods mainly focus on frame-level generative methods or clip-level discriminative methods, which generally ignore the local information where anomalies are usually found more easily. Moreover, they apply multiple systems on one ASD task, which is lacking in generalizability. For tracking this, our method extracts patch-level features to learn unified audio representation that generalizes well and models local information that is beneficial to detecting anomalies under domain shifts by self-supervised representation learning and it further optimizes the informativeness of clip-level representations in finetuning. Concretely, the input spectrograms are randomly split into two patch-level subsets, and then they are fed into DP-MAE to predict each other. Meanwhile, the output of one path is also considered to be the predicted objective of the other path to perform regularization from the perspective of self-distillation. In fine-tuning stage, a linear classifier is applied on the features produced by the encoder to get a more compact representation of normal sound. Experiments on DCASE 2022 Challenge Task2 development dataset show the effectiveness of our method.

关键词：

来源：评论

学校读者我要写书评

暂无评论

The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge: Tasks, Results and Findings

arXiv

引用

arXiv 2024年

作者： Xia, Kangxiang Guo, Dake Yao, Jixun Xue, Liumeng Li, Hanzhao Wang, Shuai Guo, Zhao Xie, Lei Zhang, Qingqing Luo, Lei Dong, Minghui Sun, Peng Audio Speech and Language Processing Group ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China China China Magic data China Singapore China Computer Federation China

The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings. The challenge’s official website is https://***/iscslp-2024. Copyright © 2024, The Authors. All rights reserved.

关键词： speech synthesis

来源：评论

学校读者我要写书评

暂无评论

SELM: speech Enhancement using Discrete Tokens and language Models

SELM: Speech Enhancement using Discrete Tokens and Language ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Ziqian Wang Xinfa Zhu Zihan Zhang YuanJun Lv Ning Jiang Guoqing Zhao Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China Mashang Consumer Finance Co. Ltd.

language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available 1 .

关键词：

来源：评论

学校读者我要写书评

暂无评论

Are Clinical T5 Models Better for Clinical Text?

arXiv

引用

arXiv 2024年

作者： Li, Yahan Harrigian, Keith Zirikly, Ayah Dredze, Mark Department of Computer Science University of Southern California United States Department of Computer Science Johns Hopkins University United States Center for Language and Speech Processing Whiting School of Engineering Johns Hopkins University United States Malone Center for Engineering in Healthcare Johns Hopkins University United States

Large language models with a transformerbased encoder/decoder architecture, such as T5 (Raffel et al., 2023), have become standard platforms for supervised tasks. To bring these technologies to the clinical domain, recent work has trained new (Lehman et al., 2023) or adapted existing (Lu et al., 2022) models to clinical data. However, the evaluation of these clinical T5 models and comparison to other models has been limited. Are the clinical T5 models better choices than FLAN-tuned (Chung et al., 2022a) generic T5 models? Do they generalize better to new clinical domains that differ from the training sets? We comprehensively evaluate these models across several clinical tasks and domains. We find that clinical T5 models provide marginal improvements over existing models, and perform worse when evaluated on different domains. Our results inform future choices in developing clinical LLMs. © 2024, CC BY-NC-ND.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Preserving Background Sound in Noise-Robust Voice Conversion Via Multi-Task Learning

Preserving Background Sound in Noise-Robust Voice Conversion...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Jixun Yao Yi Lei Qing Wang Pengcheng Guo Ziqian Ning Lei Xie Hai Li Junhui Liu Danming Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China iQIYI Inc China

Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multitask learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and limits the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task share a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

关键词： Training Source separation Acoustic distortion Upper bound Multitasking Feature extraction Timbre

来源：评论

学校读者我要写书评

暂无评论

Stargan-vc Based Cross-Domain Data Augmentation for Speaker Verification

Stargan-vc Based Cross-Domain Data Augmentation for Speaker ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Hang-Rui Hu Yan Song Jian-Tao Zhang Li-Rong Dai Ian McLoughlin Zhu Zhuo Yu Zhou Yu-Hong Li Hui Xue National Engineering Research Center for Speech and Language Information Processing University of Science and Technology of China Hefei China Alibaba Group China

Automatic speaker verification (ASV) faces domain shift caused by the mismatch of intrinsic and extrinsic factors, such as recording device and speaking style, in real-world applications, which leads to severe performance degradation. Since single-speaker multi-condition (SSMC) data is difficult to collect in practice, existing domain adaptation methods are hard to ensure the feature consistency of the same class but different domains. To this end, we propose a cross-domain data generation method to obtain a domain-invariant ASV system. Inspired by voice conversion (VC) task, a StarGAN based generative model first learns cross-domain mappings from SSMC data, and then generates missing domain data for all speakers, thus increasing the intra-class diversity of the training set. Considering the difference between ASV and VC task, we renovate the corresponding training objectives and network structure to make the adaptation task-specific. Evaluations on achieve a relative performance improvement of about 5-8% over the baseline in terms of minDCF and EER, outperforming the CNSRC winner’s system of the equivalent scale.

关键词： Training Performance evaluation Degradation Adaptation models Data models Acoustics Recording

来源：评论

学校读者我要写书评

暂无评论

Zero-Shot Personalized Lip-To-speech Synthesis with Face Image Based Voice Control

Zero-Shot Personalized Lip-To-Speech Synthesis with Face Ima...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zheng-Yan Sheng Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R. China

Lip-to-speech (Lip2speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2speech synthesis with a face image rather than reference audio to control voice characteristics.

关键词： Representation learning Training Signal processing Predictive models Linguistics Acoustics speech synthesis

来源：评论

学校读者我要写书评

暂无评论

Neural speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses

Neural Speech Phase Prediction Based on Parallel Estimation ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R.China

This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed.

关键词： Training Estimation Signal processing algorithms Predictive models Prediction algorithms Iterative algorithms Delays

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：