检索结果-内蒙古大学图书馆

Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Li, Tianjian Murray, Kenton Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States

Zero-shot cross-lingual transfer is when a multilingual model is trained to perform a task in one language and then is applied to another language. Although the zero-shot cross-lingual transfer approach has achieved success in various classification tasks (Wu and Dredze, 2019), its performance on natural language generation tasks falls short in quality (Ronnqvist et al., 2019;Vu et al., 2022) and sometimes outputs an incorrect language (Xue et al., 2021). In our study, we show that the fine-tuning process learns language invariant representations, which is beneficial for classification tasks but harmful for generation tasks. Motivated by this, we propose a simple method to regularize the model from learning language invariant representations and a method to select model checkpoints without a development set in the target language, both resulting in better generation quality. Experiments on three semantically diverse generation tasks show that our method reduces the accidental translation problem by 68% and improves the ROUGE-L score (Lin, 2004) by 1.5 on average. © 2023, CC BY.

关键词： Zero-shot learning

Building Keyword Search System from End-To-End Asr Systems

学校读者我要写书评

暂无评论

Building Keyword Search System from End-To-End Asr Systems

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Ruizhe Huang Matthew Wiesner Leibny Paola Garcia-Perera Dan Povey Jan Trmal Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University USA Human Language Technology Center of Excellence Johns Hopkins University USA Xiaomi Corporation Beijing China

Keyword search (KWS) systems are commonly built on top of existing automatic speech recognition (ASR) systems. However, end-to-end (E2E) ASR models are not naturally equipped with word-level timing information or confidence. Existing methods for re-purposing E2E ASR systems for KWS are largely heuristic or model-specific. In this paper, we describe a general KWS pipeline, applicable to any ASR model that generates N-best lists. We extract timing information using either external word-aligners, or time-preserving weighted finite-state transducer-based decoders. We show that our light-weight, ASR-agnostic approach for confidence estimation based on N-best lists outperforms other commonly used heuristics, such as using the decoder’s softmax probability, and even a more complicated dedicated confidence estimation model (CEM). Finally, we compare our performance to hybrid ASR models, extensively evaluating the impact of word-level timing, confidence, and recall on KWS performance. Our KWS pipeline is available online 1 , suitable for evaluating the aforementioned ASR components as downstream tasks.

关键词： Measurement Pipelines Keyword search Estimation Timing Decoding Error correction

Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization

学校读者我要写书评

暂无评论

Privacy Versus Emotion Preservation Trade-Offs in Emotion-Pr...

IEEE Spoken language technology Workshop

作者： Zexin Cai Henry Li Xinyuan Ashi Garg Leibny Paola García-Perera Kevin Duh Sanjeev Khudanpur Nicholas Andrews Matthew Wiesner Human Language Technology Center of Excellence Johns Hopkins University

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.

关键词： Emotion recognition Privacy Differential privacy Conferences Pipelines speech recognition Linguistics Information filtering Information integrity

SURT 2.0: Advances in Transducer-based Multi-talker speech Recognition

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Raj, Desh Povey, Daniel Khudanpur, Sanjeev The Center for Language and Speech Processing Johns Hopkins University BaltimoreMD21218 United States Xiaomi Corp. Beijing China The Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States

The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors;(ii) it is computationally expensive, due to which it has not seen adoption in academia;and (iii) it has only been evaluated on synthetic mixtures. In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v) use auxiliary objectives in the form of masking loss and encoder CTC loss, and (vi) perform domain adaptation for far-field recognition. We show that our modifications allow SURT 2.0 to outperform its predecessor in terms of multi-talker ASR results, while being efficient enough to train with academic resources. We conduct our evaluations on 3 publicly available meeting benchmarks - LibriCSS, AMI, and ICSI, where our best model achieves WERs of 16.9%, 44.6% and 32.2%, respectively, on far-field unsegmented recordings. We release training recipes and pre-trained models: https://***/view/surt2. © 2023, CC BY.

关键词： Transducers

Identifying Context-Dependent Translations for Evaluation Set Production

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wicks, Rachel Post, Matt Human Language Technology Center of Excellence Johns Hopkins University United States Center of Language and Speech Processing Johns Hopkins University United States Microsoft United States

A major impediment to the transition to context-aware machine translation is the absence of good evaluation metrics and test sets. Sentences that require context to be translated correctly are rare in test sets, reducing the utility of standard corpus-level metrics such as COMET or BLEU. On the other hand, datasets that annotate such sentences are also rare, small in scale, and available for only a few languages. To address this, we modernize, generalize, and extend previous annotation pipelines to produce CTXPRO, a tool that identifies subsets of parallel documents containing sentences that require context to correctly translate five phenomena: gender, formality, and animacy for pronouns, verb phrase ellipsis, and ambiguous noun inflections. The input to the pipeline is a set of handcrafted, per-language, linguistically-informed rules that select contextual sentence pairs using coreference, part-of-speech, and morphological features provided by state-of-the-art tools. We apply this pipeline to seven languages pairs (EN into and out-of DE, ES, FR, IT, PL, PT, and RU) and two datasets (OpenSubtitles and WMT test sets), and validate its performance using both overlap with previous work and its ability to discriminate a contextual MT system from a sentence-based one. We release the CTXPRO pipeline and data as open source. Copyright © 2023, The Authors. All rights reserved.

关键词： Pipelines

Finding Spoken Identifications: Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline 30

学校读者我要写书评

暂无评论

Finding Spoken Identifications: Using GPT-4 Annotation For A...

Joint 30th International Conference on Computational Linguistics and 14th International Conference on language Resources and Evaluation, LREC-COLING 2024

作者： Jahan, Maliha Wang, Helin Thebaud, Thomas Sun, Yinglun Le, Giang Fagyal, Zsuzsanna Scharenborg, Odette Hasegawa-Johnson, Mark Moro-Velazquez, Laureano Dehak, Najim Center for Language and Speech Processing Johns Hopkins University BaltimoreMD United States University of Illinois Urbana-Champaign ChampaignIL United States Multimedia Computing Group Delft University of Technology Netherlands

ISBN: (纸本)9782493814104

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance. © 2024 ELRA language Resource Association: CC BY-NC 4.0.

关键词： Pipelines

Noise-robust speech Separation with Fast Generative Correction

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wang, Helin Villalba, Jesús Moro-Velazquez, Laureano Hai, Jiarui Thebaud, Thomas Dehak, Najim Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States Laboratory for Computational Auditory Perception Johns Hopkins University United States

speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative separator. By leveraging a generative corrector based on a diffusion model, we refine the separation process for single-channel mixture speech by removing noises and perceptually unnatural distortions. Furthermore, we optimize the generative model using a predictive loss to streamline the diffusion model’s reverse process into a single step and rectify any associated errors by the reverse process. Our method achieves state-of-the-art performance on the in-domain Libri2Mix noisy dataset, and out-of-domain WSJ with a variety of noises, improving SI-SNR by 22-35% relative to SepFormer, demonstrating robustness and strong generalization capabilities. © 2024, CC0.

关键词： Signal to noise ratio

Self-FiLM: Conditioning GANs with self-supervised representations for bandwidth extension based speaker recognition

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Kataria, Saurabh Villalba, Jesús Moro-Velázquez, Laureano Thebaud, Thomas Dehak, Najim Center for Language and Speech Processing Johns Hopkins University BaltimoreMD United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

speech super-resolution/Bandwidth Extension (BWE) can improve downstream tasks like Automatic Speaker Verification (ASV). We introduce a simple novel technique called Self-FiLM to inject self-supervision into existing BWE models via Feature-wise Linear Modulation. We hypothesize that such information captures domain/environment information, which can give zero-shot generalization. Self-FiLM Conditional GAN (CGAN) gives 18% relative improvement in Equal Error Rate and 8.5% in minimum Decision Cost Function using state-of-the-art ASV system on SRE21 test. We further by 1) deep feature loss from time-domain models and 2) re-training of data2vec 2.0 models on naturalistic wideband (VoxCeleb) and telephone data (SRE Superset etc.). Lastly, we integrate self-supervision with CycleGAN to present a completely unsupervised solution that matches the semi-supervised performance. Copyright © 2023, The Authors. All rights reserved.

关键词： Cost functions

GenVC: Self-Supervised Zero-Shot Voice Conversion

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Cai, Zexin Xinyuan, Henry Li Garg, Ashi García-Perera, Leibny Paola Duh, Kevin Khudanpur, Sanjeev Wiesner, Matthew Andrews, Nicholas Human Language Technology Center of Excellence Johns Hopkins University United States

Zero-shot voice conversion has recently made substantial progress, but many models still depend on external supervised systems to disentangle speaker identity and linguistic content. Furthermore, current methods often use parallel conversion, where the converted speech inherits the source utterance’s temporal structure, restricting speaker similarity and privacy. To overcome these limitations, we introduce GenVC,1 a generative zero-shot voice conversion model. GenVC learns to disentangle linguistic content and speaker style in a self-supervised manner, eliminating the need for external models and enabling efficient training on large, unlabeled datasets. Experimental results show that GenVC achieves state-of-the-art speaker similarity while maintaining naturalness competitive with leading approaches. Its autoregressive generation also allows the converted speech to deviate from the source utterance’s temporal structure. This feature makes GenVC highly effective for voice anonymization, as it minimizes the preservation of source prosody and speaker characteristics, enhancing privacy protection. Copyright © 2025, The Authors. All rights reserved.

关键词： Anonymity