检索结果-内蒙古大学图书馆

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zihan Zhang Shimin Zhang Mingshuai Liu Yanhong Leng Zhe Han Li Chen Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China ByteDance China

This paper describes a Two-step Band-split Neural Network (TBNN) approach for full-band acoustic echo cancellation. Specifically, after linear filtering, we split the full-band signal into wideband (16KHz) and high-band (16-48KHz) for residual echo removal with lower modeling difficulty. The wide-band signal is processed by an updated gated convolutional recurrent network (GCRN) with U 2 encoder while the high-band signal is processed by a high-band post-filter net with lower complexity. Our approach submitted to ICASSP 2023 AEC Challenge has achieved an overall mean opinion score (MOS) of 4.344 and a word accuracy (WAcc) ratio of 0.795, leading to the 2 nd (tied) in the ranking of the non-personalized track.

关键词： Maximum likelihood detection Echo cancellers Convolution Neural networks Logic gates Acoustics Complexity theory

来源：评论

学校读者我要写书评

暂无评论

Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints

Delivering Speaking Style in Low-Resource Voice Conversion w...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zhichao Wang Xinsheng Wang Lei Xie Yuanzhe Chen Qiao Tian Yuping Wang Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Speech Audio & Music Intelligence (SAMI) ByteDance

Conveying the linguistic content and maintaining the source speech’s speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker’s timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker’s limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mis-match between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

关键词： Training Clustering methods Linguistics Signal processing speech Data models Acoustics

来源：评论

学校读者我要写书评

暂无评论

VATEX2020: PLSTM framework for video captioning

VATEX2020: PLSTM framework for video captioning

引用

2022 International Conference on Machine Learning and Data Engineering, ICMLDE 2022

作者： Singh, Alok Singh, Salam Michael Meetei, Loitongbam Sanayai Das, Ringki Singh, Thoudam Doren Bandyopadhyay, Sivaji Department of Computer Science and Engineering National Institute of Technology Assam Silchar India Center for Natural Language Processing National Institute of Technology Assam Silchar India

Captioning a video involves condensing the video's information into text, which can be useful in video sentiment analysis, video-guided machine translation (VMT), visual question-answering and humanitarian aid. This paper discusses the details of the architecture of the pLSTM framework that is employed for the VATEX-2020 video captioning challenge. In this work, a sequential method is employed wherein to encode visual features a 3D convolutional neural network (C3D) is used. C3D was pretrained using the Sports-1M dataset. In the decoding phase, the input captions and visual features are fused separately in Long Short Term Memory networks (LSTM). The element-wise dot product is performed on the output of both LSTMs to get the final output. On both publicly available and private test data sets, our model achieves BLEU-4 scores of 0.20 and 0.22, respectively. © 2023 The Authors. Published by Elsevier B.V.

关键词： Sentiment analysis

来源：评论

学校读者我要写书评

暂无评论

Joint Pre-Training with speech and Bilingual Text for Direct speech to speech Translation

Joint Pre-Training with Speech and Bilingual Text for Direct...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Kun Wei Long Zhou Ziqiang Zhang Liping Chen Shujie Liu Lei He Jinyu Li Furu Wei Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China Microsoft Corporation

Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from the speech of the source language to the speech of the target language are very rare. To address this issue, we propose in this paper a speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models 1 .

关键词： Analytical models speech enhancement Signal processing Data models Acoustics Data mining Task analysis

来源：评论

学校读者我要写书评

暂无评论

Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through speech

arXiv

引用

arXiv 2024年

作者： Liu, Yin-Long Feng, Rui Yuan, Jiahong Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China Interdisciplinary Research Center for Linguistic Sciences University of Science and Technology of China Hefei China

We uncover an underlying bias present in the audio recordings produced from the picture description task of the Pitt corpus, the largest publicly accessible database for Alzheimer’s Disease (AD) detection research. Even by solely utilizing the silent segments of these audio recordings, we achieve nearly 100% accuracy in AD detection. However, employing the same methods to other datasets and preprocessed Pitt recordings results in typical levels (approximately 80%) of AD detection accuracy. These results demonstrate a Clever Hans effect in AD detection on the Pitt corpus. Our findings emphasize the crucial importance of maintaining vigilance regarding inherent biases in datasets utilized for training deep learning models, and highlight the necessity for a better understanding of the models’ performance. Copyright © 2024, The Authors. All rights reserved.

关键词： Audio recordings

来源：评论

学校读者我要写书评

暂无评论

Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling

Distinguishable Speaker Anonymization Based on Formant and F...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Jixun Yao Qing Wang Yi Lei Pengcheng Guo Lei Xie Namin Wang Jie Liu Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Huawei Cloud

speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker verification (ASV)-model-free anonymization framework to protect speaker privacy while preserving speech intelligibility. Although the framework ranked first place in VoicePrivacy 2022 challenge, the anonymization was imperfect, since the speaker distinguishability of the anonymized speech was deteriorated. To address this issue, in this paper, we directly model the formant distribution and fundamental frequency (F0) to represent speaker identity and anonymize the source speech by the uniformly scaling formant and F0. By directly scaling the formant and F0, the speaker distinguishability degradation of the anonymized speech caused by the introduction of other speakers is prevented. The experimental results demonstrate that our proposed framework can improve the speaker distinguishability and significantly outperforms our previous framework in voice distinctiveness. Furthermore, our proposed method can trade off the privacy-utility by using different scaling factors.

关键词： Data privacy Privacy Social networking (online) Signal processing Linguistics Information filtering Internet

来源：评论

学校读者我要写书评

暂无评论

An Exploration of Task-Decoupling on Two-Stage Neural Post Filter for Real-Time Personalized Acoustic Echo Cancellation

An Exploration of Task-Decoupling on Two-Stage Neural Post F...

引用

IEEE Workshop on Automatic speech Recognition and Understanding

作者： Zihan Zhang Jiayao Sun Xianjun Xia Ziqian Wang Xiaopeng Yan Yijian Xiao Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China ByteDance China

Deep learning based techniques have been popularly adopted in acoustic echo cancellation (AEC). Utilization of speaker representation has extended the frontier of AEC, thus attracting many researchers’ interest in personalized acoustic echo cancellation (PAEC). Meanwhile, task-decoupling strategies are widely adopted in speech enhancement. To further explore the task-decoupling approach, we propose to use a two-stage task-decoupling post-filter (TDPF) in PAEC. Furthermore, a multi-scale local-global speaker representation is applied to improve speaker extraction in PAEC. Experimental results indicate that the task-decoupling model can yield better performance than a single joint network. The optimal approach is to decouple the echo cancellation from noise and interference speech suppression. Based on the task-decoupling sequence, optimal training strategies for the two-stage model are explored afterwards.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Sagalee: an Open Source Automatic speech Recognition Dataset for Oromo language

arXiv

引用

arXiv 2025年

作者： Abu, Turi Shi, Ying Zheng, Thomas Fang Wang, Dong Center for Speech and Language Technologies BNRist Beijing China Department of Computer Science and Technology Tsinghua University Beijing China School of Computer Science and Technology Harbin Institute of Technology Harbin China

We present a novel Automatic speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://***/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing. © 2025, CC BY.

关键词： speech recognition

来源：评论

学校读者我要写书评

暂无评论

PERTURBATION-RESTRAINED SEQUENTIAL MODEL EDITING

arXiv

引用

arXiv 2024年

作者： Ma, Jun-Yu Wang, Hong Xu, Hao-Xiang Ling, Zhen-Hua Gu, Jia-Chen University of Science and Technology of China China National Engineering Research Center of Speech and Language Information Processing China University of California Los Angeles United States

Model editing is an emerging field that focuses on updating the knowledge embedded within large language models (LLMs) without extensive retraining. However, current model editing methods significantly compromise the general abilities of LLMs as the number of edits increases, and this trade-off poses a substantial challenge to the continual learning of LLMs. In this paper, we first theoretically analyze that the factor affecting the general abilities in sequential model editing lies in the condition number of the edited matrix. The condition number of a matrix represents its numerical sensitivity, and therefore can be used to indicate the extent to which the original knowledge associations stored in LLMs are perturbed after editing. Subsequently, statistical findings demonstrate that the value of this factor becomes larger as the number of edits increases, thereby exacerbating the deterioration of general abilities. To this end, a framework termed Perturbation Restraint on Upper bouNd for Editing (PRUNE) is proposed, which applies the condition number restraints in sequential editing. These restraints can lower the upper bound on perturbation to edited models, thus preserving the general abilities. Systematically, we conduct experiments employing three editing methods on three LLMs across four downstream tasks. The results show that PRUNE can preserve general abilities while maintaining the editing performance effectively in sequential model editing. The code are available at https://***/mjy1111/PRUNE. Copyright © 2024, The Authors. All rights reserved.

关键词： Matrix algebra

来源：评论

学校读者我要写书评

暂无评论

Incident Task Sequence for Service Priority using Cosine Similarity 1

Incident Task Sequence for Service Priority using Cosine Sim...

引用

1st International Conference on Technology Innovation and Its Applications, ICTIIA 2022

作者： Boonprapapan, Teratam Horata, Punyaphol Seresangtakul, Pusadee Natural Language And Speech Processing Laboratory College Of Computing Khon Kaen University Department Of Computer Science Khon Kaen40002 Thailand Advanced Smart Computing Laboratory College Of Computing Khon Kaen University Department Of Computer Science Khon Kaen40002 Thailand

ISBN: (数字)9781665488266

ISBN: (纸本)9781665488266

The article herein details a procedure for classifying service cases by priority level based on the service level agreement (SLA) between an organization and the customer. The main factor in the article's publication was the accuracy of the classification of the importance of internal service work. However, many service evaluators remain confused about the tiering of service cases. Therefore, creating accurate service case classification models is imperative to simplify the classification process. The service cases consisted of four levels: series, critical, moderate, and low. We employed natural language processing (NLP) to develop a more efficient priority level of service for the organization. We implemented the weighting of the term frequency - inverse document frequency (TF-IDF) method and cosine Similarity with the measuring degree concept of similarity terms within each service case. The model consisted of four processes: data collection, preprocessing, TF-IDF calculation, and similarity and scoring calculation. The model presented here improved the accuracy of the classified process and produced better results in the test sets, measuring the efficiency from the cosine similarity. Lastly, our research contained 5,790 service cases with an accuracy of 70.14%, achieved through the combination of TF-IDF and cosine similarity. © 2022 IEEE.

关键词： Natural language processing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：