检索结果-内蒙古大学图书馆

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qian Wang Jia-Chen Gu Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R.China

Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Considering Temporal Connection between Turns for Conversational speech Synthesis

Considering Temporal Connection between Turns for Conversati...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Kangdi Mei Zhaoci Liu Huipeng Du Hengyu Li Yang Ai Liping Chen Zhenhua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R. China

Conversational speech synthesis aims to synthesize speech of an individual speaker based on history conversation. However, most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered. To complete this task, an acoustic model is proposed which leverages multi-modal (including text and speech) information from previous turn to predict the acoustic features of not only current turn but also the inter-turn gap. The model is designed based on MQTTS and incorporates the global acoustic representation and BERT-based local semantic representation of previous turn when predicting the acoustic features of each frame. Experimental results demonstrate that with the introduction of global acoustic information and local semantic information, our model achieves better performance on the temporal connection between turns and the quality of synthetic speech. Audio samples can be found in https://***/icassp2024.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Privacy-Preserving Blockchain-Based Solutions in the Internet of Things 1

引用

6th EAI International Conference on science and Technologies for Smart Cities, SmartCity 2020

作者： Zapoglou, Nikolaos Patsakos, Ioannis Drosatos, George Rantos, Konstantinos Department of Computer Science International Hellenic University Kavala Greece Institute for Language and Speech Processing Athena Research Center Xanthi Greece

ISBN: (数字)9783030760632

ISBN: (纸本)9783030760625

Internet of Things (IoT) is a promising, relatively new technology that develops "smart" networks with a variety of uses and applications (e.g., smart cities, smart home and autonomous cars). The diversity of protocols, technologies and devices that IoT consists of, even though they add in value and utility, they create major privacy issues that can be exploited by malicious entities to benefit from or even violate privacy of IoT users. The special features of blockchain technology, such as immutability, transparency, accessibility, autonomy and decentralisation, has led the academics and the industry to search for further uses of it, besides financial applications (e.g., Bitcoin) that was initially applied. This paper is a survey on the existing literature regarding blockchain-based privacy-preserving solutions that have been proposed specifically for the IoT to address personal data protection and preserve user privacy. © 2021, ICST Institute for computer sciences, Social Informatics and Telecommunications Engineering.

关键词： Internet of things

来源：评论

学校读者我要写书评

暂无评论

An Exploratory Approach to the Corpus Filtering Shared Task WMT20 5

An Exploratory Approach to the Corpus Filtering Shared Task ...

引用

5th Conference on Machine Translation, WMT 2020

作者： Kejriwal, Ankur Koehn, Philipp Department of Computer Science Johns Hopkins University United States Center for Language and Speech Processing Johns Hopkins University United States

ISBN: (纸本)9781948087810

This document describes an exploratory look into the Parallel Corpus Filtering Shared Task in WMT20. We submitted scores for both Pashto-English and Khmer-English systems combining multiple techniques like monolingual language model scores, length based filters, language ID filters with confidence and norm of embedings. © 2020 Association for Computational Linguistics

关键词： Computational linguistics

来源：评论

学校读者我要写书评

暂无评论

HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models

HDMoLE: Mixture of LoRA Experts with Hierarchical Routing an...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Bingshen Mu Kun Wei Qijie Shao Yong Xu Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi'an China Tencent AI Lab Shenzhen China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Recent advancements in integrating Large language Models (LLM) with automatic speech recognition (ASR) have performed remarkably in general domains. While supervised fine-tuning (SFT) of all model parameters is often employed to adapt pre-trained LLM-based ASR models to specific domains, it imposes high computational costs and notably reduces their performance in general domains. In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named HDMoLE, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixture of experts (MoE) and can be generalized to any linear layer. Hierarchical routing establishes a clear correspondence between LoRA experts and accent domains, improving cross-domain collaboration among the LoRA experts. Unlike the static Top-K strategy for activating LoRA experts, dynamic thresholds can adaptively activate varying numbers of LoRA experts at each MoE layer. Experiments on the multi-accent and standard Mandarin datasets demonstrate the efficacy of HDMoLE. Applying HDMoLE to an LLM-based ASR model projector module achieves similar performance to full fine-tuning in the target multi-accent domains while using only 9.6% of the trainable parameters required for full fine-tuning and minimal degradation in the source general domain.

关键词： Training Degradation Adaptation models Computational modeling Large language models Collaboration Signal processing Routing speech processing Standards

来源：评论

学校读者我要写书评

暂无评论

The NPU-ASLP System for Audio-Visual speech Recognition in MISP 2022 Challenge

The NPU-ASLP System for Audio-Visual Speech Recognition in M...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Pengcheng Guo He Wang Bingshen Mu Ao Zhang Peikun Chen Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China

This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based speech processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectiveness of Branchformer and E-Branchformer based ASR systems. To better make use of the visual modality, a cross-attention based multi-modal fusion module is proposed, which explicitly learns the contextual relationship between different modalities. Experiments show that our system achieves a concatenated minimum-permutation character error rate (cpCER) of 28.13% and 31.21% on the Dev and Eval set, and obtains a second place in the challenge.

关键词： Visualization Source separation Error analysis speech recognition Data processing Data models Reverberation

来源：评论

学校读者我要写书评

暂无评论

Is ChatGPT a Good Multi-Party Conversation Solver?

arXiv

引用

arXiv 2023年

作者： Tan, Chao-Hong Gu, Jia-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

Large language Models (LLMs) have emerged as influential instruments within the realm of natural language processing;nevertheless, their capacity to handle multi-party conversations (MPCs) – a scenario marked by the presence of multiple interlocutors involved in intricate information exchanges – remains uncharted. In this paper, we delve into the potential of generative LLMs such as ChatGPT and GPT-4 within the context of MPCs. An empirical analysis is conducted to assess the zero-shot learning capabilities of ChatGPT and GPT-4 by subjecting them to evaluation across three MPC datasets that encompass five representative tasks. The findings reveal that ChatGPT’s performance on a number of evaluated MPC tasks leaves much to be desired, whilst GPT-4’s results portend a promising future. Additionally, we endeavor to bolster performance through the incorporation of MPC structures, encompassing both speaker and addressee architecture. This study provides an exhaustive evaluation and analysis of applying generative LLMs to MPCs, casting a light upon the conception and creation of increasingly effective and robust MPC agents. Concurrently, this work underscores the challenges implicit in the utilization of LLMs for MPCs, such as deciphering graphical information flows and generating stylistically consistent responses. Copyright © 2023, The Authors. All rights reserved.

关键词： Zero-shot learning

来源：评论

学校读者我要写书评

暂无评论

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Prototype based Masked Audio Model for Self-Supervised Learn...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Pengfei Cai Yan Song Nan Jiang Qing Gu Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model (PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the leaning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.

关键词： Representation learning Event detection Computational modeling Prototypes Signal processing algorithms Self-supervised learning Signal processing Transformers Data models speech processing

来源：评论

学校读者我要写书评

暂无评论

MP-SENet: A speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

arXiv

引用

arXiv 2023年

作者： Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper proposes MP-SENet, a novel speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of parallel magnitude mask decoder and phase decoder, directly recovering clean magnitude spectra and clean-wrapped phase spectra by incorporating learnable sigmoid activation and parallel phase estimation architecture, respectively. Multi-level losses defined on magnitude spectra, phase spectra, short-time complex spectra, and time-domain waveforms are used to train the MP-SENet model jointly. Experimental results show that our proposed MP-SENet achieves a PESQ of 3.50 on the public VoiceBank+DEMAND dataset and outperforms existing advanced speech enhancement methods. Copyright © 2023, The Authors. All rights reserved.

关键词： speech enhancement

来源：评论

学校读者我要写书评

暂无评论

speech RECONSTRUCTION FROM SILENT TONGUE AND LIP ARTICULATION BY PSEUDO TARGET GENERATION AND DOMAIN ADVERSARIAL TRAINING

arXiv

引用

arXiv 2023年

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition, our proposed method also outperforms the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by approximately 10%. Copyright © 2023, The Authors. All rights reserved.

关键词： Iterative methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：