检索结果-内蒙古大学图书馆

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Hu, Hang-Rui Song, Yan Zhang, Jian-Tao Dai, Li-Rong McLoughlin, Ian Zhuo, Zhu Zhou, Yu Li, Yu-Hong Xue, Hui University of Science and Technology of China National Engineering Research Center for Speech and Language Information Processing Hefei China Alibaba Group China

ISBN: (纸本)9781728163277

Automatic speaker verification (ASV) faces domain shift caused by the mismatch of intrinsic and extrinsic factors, such as recording device and speaking style, in real-world applications, which leads to severe performance degradation. Since single-speaker multi-condition (SSMC) data is difficult to collect in practice, existing domain adaptation methods are hard to ensure the feature consistency of the same class but different domains. To this end, we propose a cross-domain data generation method to obtain a domain-invariant ASV system. Inspired by voice conversion (VC) task, a StarGAN based generative model first learns cross-domain mappings from SSMC data, and then generates missing domain data for all speakers, thus increasing the intra-class diversity of the training set. Considering the difference between ASV and VC task, we renovate the corresponding training objectives and network structure to make the adaptation task-specific. Evaluations on achieve a relative performance improvement of about 5-8% over the baseline in terms of minDCF and EER, outperforming the CNSRC winner's system of the equivalent scale. © 2023 IEEE.

关键词： Data Augmentation Domain Adaptation Speaker Verification StarGAN

来源：评论

学校读者我要写书评

暂无评论

Faux Polyglot: A Study on Information Disparity in Multilingual Large language Models

arXiv

引用

arXiv 2024年

作者： Sharma, Nikhil Murray, Kenton Xiao, Ziang Johns Hopkins University United States Center for Speech and Language Processing United States Human Language Technology Center for Excellence United States

Although the multilingual capability of LLMs offers new opportunities to overcome the language barrier, do these capabilities translate into real-life scenarios where linguistic divide and knowledge conflicts between multilingual sources are known occurrences? In this paper, we studied LLM’s linguistic preference in a cross-language RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language in both document retrieval and answer generation. Furthermore, in scenarios where no information is in the language of the query, LLMs prefer documents in high-resource languages during generation, potentially reinforcing the dominant views. Such bias exists for both factual and opinion-based queries. Our results highlight the linguistic divide within multilingual LLMs in information search systems. The seemingly beneficial multilingual capability of LLMs may backfire on information parity by reinforcing language-specific information cocoons or filter bubbles further marginalizing low-resource views. Copyright © 2024, The Authors. All rights reserved.

关键词： Structured Query language

来源：评论

学校读者我要写书评

暂无评论

Joint Energy-Based Model for Robust speech Classification System Against Dirty-Label Backdoor Poisoning Attacks

Joint Energy-Based Model for Robust Speech Classification Sy...

引用

2023 IEEE Automatic speech Recognition and Understanding Workshop, ASRU 2023

作者： Sustek, Martin Joshi, Sonal Li, Henry Thebaud, Thomas Villalba, Jesus Khudanpur, Sanjeev Dehak, Najim Johns Hopkins University Center for Language and Speech Processing BaltimoreMD United States Brno University of Technology Faculty of Information Technology Czech Republic

ISBN: (纸本)9798350306897

Our novel technique utilizes a Joint Energy-based Model (JEM) that integrates both discriminative and generative approaches to increase resistance against dirty-label backdoor attacks. Our approach is especially effective when the trigger is short or hardly perceivable. We simulate the attack on the speech Commands Dataset consisting of 1s audio clips. During training, we use JEM to model a view of the input implemented by a randomly selected 610ms window. During inference, we combine all (40) possible views utilizing a generative part of JEM. The resulting system has slightly decreased accuracy but significantly increased resistance shown in multiple scenarios. Interestingly, replacing JEM with a standard discriminative model (Disc) provides increased resistance with a lesser effect compared to JEM but maintains accuracy. We introduce an extension motivated by semi-supervised training that further improves JEM but not Disc. JEM can also benefit from Gaussian noise during evaluation. © 2023 IEEE.

关键词： Gaussian noise (electronic)

来源：评论

学校读者我要写书评

暂无评论

Adapting Self-Supervised Models to Multi-Talker speech Recognition Using Speaker Embeddings

Adapting Self-Supervised Models to Multi-Talker Speech Recog...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zili Huang Desh Raj Paola García Sanjeev Khudanpur Center for Language and Speech Processing and Human Language Technology Center of Excellence Johns Hopkins University Baltimore USA

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios — possibly due to the domain mismatch — which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset. Our code and models are open-sourced on https://***/HuangZiliAndy/SSL_for_multitalker.

关键词： Adaptation models Codes Aggregates Self-supervised learning Signal processing Acoustics Task analysis

来源：评论

学校读者我要写书评

暂无评论

TEAR: A Cross-Modal Pre-Trained Text Encoder Enhanced by Acoustic Representations for speech Synthesis

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 1117-1128页

作者： Shiming Wang Yang Ai Liping Chen Yajun Hu Zhenhua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

Text encoders play an important role in text-to-speech (TTS) by analyzing text input and converting it into linguistic representations. In order to generate expressive speech from text, pre-training text encoders on large amounts of data has recently become a solution to generate richer and more effective linguistic representations. However, existing pre-trained text encoders only use the self-supervised target on the text data, without considering the relationship between text and speech modalities during the pre-training stage. In this paper, we propose TEAR, a cross-modal pre-trained Text Encoder enhanced by Acoustic Representations for TTS. In addition to conventional text pre-training, TEAR incorporates speech pre-training to extract semantic and prosody-related acoustic representations from speech. Then, TEAR introduces a novel cross-modal pre-training task for the text encoder, termed acoustics-aware joint prediction. This task leverages the acoustic representations generated by the preceding speech pre-training, enabling the linguistic representation to perceive and comprehend prosody during the encoding process. In our implementation, TEAR was pre-trained on 130 million unlabeled Chinese and English sentences, as well as 740,000 Chinese text-speech pairs. The results of the downstream TTS experiments on three expressive TTS datasets indicate that the proposed TEAR can encode more effective and comprehensive linguistic representations compared to the text-only pre-trained encoders, leading to the generation of more natural speech.

关键词： Acoustics Linguistics Encoding Bidirectional control Training Data models Predictive models speech enhancement Transformers Context modeling

来源：评论

学校读者我要写书评

暂无评论

PQLM - Multilingual Decentralized Portable Quantum language Model 48

PQLM - Multilingual Decentralized Portable Quantum Language ...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Li, Shuyue Stella Zhang, Xiangyu Zhou, Shu Shu, Hongchao Liang, Ruixing Liu, Hexin Garcia, Leibny Paola Hong Kong University of Science and Technology Department of Physics Hong Kong Nanyang Technological University School of Electrical and Electronic Engineering Singapore Johns Hopkins University Center for Language and Speech Processing United States Johns Hopkins University Human Language Technology Center of Excellence United States

ISBN: (纸本)9781728163277

With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly portable quantum language model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with random Variational Quantum Classifiers (VQC) and local models for downstream applications. We demonstrate the ad hoc portability of the quantum model by extracting only the word embeddings and effectively applying them to downstream tasks on classical machines. Our PQLM exhibits comparable performance to its classical counterpart on both intrinsic evaluation (loss, perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy) metrics. We also perform ablation studies on the factors affecting PQLM performance to analyze model stability. Our work establishes a theoretical foundation for a portable quantum pre-trained language model that could be trained on private data and made available for public use with privacy protection guarantees. © 2023 IEEE.

关键词： Federated Learning language Modeling Model Portability Quantum Machine Learning

来源：评论

学校读者我要写书评

暂无评论

ERVQ: Enhanced Residual Vector Quantization With Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 2539-2550页

作者： Rui-Chen Zheng Hui-Peng Du Xiao-Hang Jiang Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

Current neural audio codecs typically use residual vector quantization (RVQ) to discretize audio signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal performance. To address this problem, we propose Enhanced Residual Vector Quantization ( ERVQ), a novel enhancement strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec performance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online clustering strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimization improves the diversity of quantized features by minimizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models, sampling rates, and bitrates, achieving superior quality and generalization capabilities. It also achieves 100% codebook utilization on one of the most advanced neural audio codecs. Further experiments indicate that audio codecs improved by the ERVQ strategy can improve unified speech-and-text large language models (LLMs). Specifically, there is a notable improvement in the naturalness of generated speech in downstream zero-shot text-to-speech tasks. Audio samples are available on the project page.

关键词： Codecs Codes Optimization Vectors Bit rate Vector quantization Training speech coding Image reconstruction Text to speech

来源：评论

学校读者我要写书评

暂无评论

Towards High-Quality and Efficient speech Bandwidth Extension With Parallel Amplitude and Phase Prediction

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2024年 33卷 236-250页

作者： Ye-Xin Lu Yang Ai Hui-Peng Du Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the source narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

关键词： Wideband Narrowband speech enhancement speech processing Hidden Markov models Real-time systems speech coding Predictive models Generators Statistical analysis

来源：评论

学校读者我要写书评

暂无评论

Recovering document annotations for sentence-level bitext

arXiv

引用

arXiv 2024年

作者： Wicks, Rachel Post, Matt Koehn, Philipp Human Language Technology Center of Excellence Johns Hopkins University United States Center of Language and Speech Processing Johns Hopkins University United States Microsoft United States

Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, PARADOCS, and resulting models as a resource to the community. Copyright © 2024, The Authors. All rights reserved.

关键词： Large datasets

来源：评论

学校读者我要写书评

暂无评论

Finding Spoken Identifications: Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline 30

Finding Spoken Identifications: Using GPT-4 Annotation For A...

引用

Joint 30th International Conference on Computational Linguistics and 14th International Conference on language Resources and Evaluation, LREC-COLING 2024

作者： Jahan, Maliha Wang, Helin Thebaud, Thomas Sun, Yinglun Le, Giang Fagyal, Zsuzsanna Scharenborg, Odette Hasegawa-Johnson, Mark Moro-Velazquez, Laureano Dehak, Najim Center for Language and Speech Processing Johns Hopkins University BaltimoreMD United States University of Illinois Urbana-Champaign ChampaignIL United States Multimedia Computing Group Delft University of Technology Netherlands

ISBN: (纸本)9782493814104

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance. © 2024 ELRA language Resource Association: CC BY-NC 4.0.

关键词： Pipelines

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：