检索结果-内蒙古大学图书馆

CMU’s IWSLT 2023 Simultaneous speech Translation System 20

学校读者我要写书评

暂无评论

CMU’s IWSLT 2023 Simultaneous Speech Translation System

20th International Conference on Spoken language Translation, IWSLT 2023

作者： Yan, Brian Shi, Jiatong Maiti, Soumi Chen, William Li, Xinjian Peng, Yifan Arora, Siddhant Watanabe, Shinji Language Technologies Institute Carnegie Mellon University United States Electrical and Computer Engineering Carnegie Mellon University United States Human Language Technology Center of Excellence Johns Hopkins University United States

ISBN: (纸本)9781959429845

This paper describes CMU’s submission to the IWSLT 2023 simultaneous speech translation shared task for translating English speech to both German text and speech in a streaming fashion. We first build offline speech-to-text (ST) models using the joint CTC/attention framework. These models also use WavLM front-end features and mBART decoder initialization. We adapt our offline ST models for simultaneous speech-to-text translation (SST) by 1) incrementally encoding chunks of input speech, re-computing encoder states for each new chunk and 2) incrementally decoding output text, pruning beam search hypotheses to 1-best after processing each chunk. We then build text-to-speech (TTS) models using the VITS framework and achieve simultaneous speech-to-speech translation (SS2ST) by cascading our SST and TTS models. © IWSLT *** rights reserved.

关键词： Decoding

ADAPTING SELF-SUPERVISED MODELS TO MULTI-TALKER speech RECOGNITION USING SPEAKER EMBEDDINGS

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Huang, Zili Raj, Desh García, Paola Khudanpur, Sanjeev Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University Baltimore United States

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios — possibly due to the domain mismatch — which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset. © 2022, CC BY.

关键词： Embeddings

HLTCOE at TREC 2023 NeuCLIR Track

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Eugene Lawrie, Dawn Mayfield, James Human Language Technology Center of Excellence Johns Hopkins University United States

The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques – the English model released with ColBERT v2, translate-train (TT), Translate Distill (TD) and multilingual translate-train (MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task. © 2024, CC BY.

关键词： Distillation

HLTCOE Submission to the VoicePrivacy Attacker Challenge

学校读者我要写书评

暂无评论

HLTCOE Submission to the VoicePrivacy Attacker Challenge

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Henry Li Xinyuan Ashi Garg Zexin Cai Kevin Duh Leibny Paola García-Perera Sanjeev Khudanpur Nicholas Andrews Matthew Wiesner Human Language Technology Center of Excellence Johns Hopkins University Baltimore United States

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

We describe our submission to the 2024 VoicePrivacy Attacker Challenge. We propose three main categories of methods to improve ASV performance against anonymized speech: improvements to the underlying classifier, alternative distance metrics when computing ASV scores, and kNN-VC normalization. By simultaneously employing one or more of these methods, we were able to achieve a significant reduction in EER against all of the submitted anonymization systems in the VoicePrivacy Challenge.

关键词： Measurement Data privacy Signal processing Acoustics Information filtering speech processing Information integrity

PRIVACY VERSUS EMOTION PRESERVATION TRADE-OFFS IN EMOTION-PRESERVING SPEAKER ANONYMIZATION

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cai, Zexin Xinyuan, Henry Li Garg, Ashi García-Perera, Leibny Paola Duh, Kevin Khudanpur, Sanjeev Andrews, Nicholas Wiesner, Matthew Human Language Technology Center of Excellence Johns Hopkins University United States

Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities. Copyright © 2024, The Authors. All rights reserved.

关键词： Anonymity

Regularizing Contrastive Predictive Coding for speech Applications

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Bhati, Saurabhchand Villalba, Jesús Zelasko, Piotr Moro-Velazquez, Laureano Dehak, Najim Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States Meaning.Team Inc United States

Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly improved the quality of the unsupervised representations. These representations significantly reduce the amount of labeled data needed for downstream task performance, such as automatic speech recognition. CPC learns representations by learning to predict future frames given current frames. Based on the observation that the acoustic information, e.g., phones, changes slower than the feature extraction rate in CPC, we propose regularization techniques that impose slowness constraints on the features. Here we propose two regularization techniques: Self-expressing constraint and Left-or-right regularization. We evaluate the proposed model on ABX and linear phone classification tasks, acoustic unit discovery, and automatic speech recognition. The regularized CPC trained on 100 hours of unlabeled data matches the performance of the baseline CPC trained on 360 hours of unlabeled data. We also show that our regularization techniques are complementary to data augmentation and can further boost the system's performance. In monolingual, cross-lingual, or multilingual settings, with/without data augmentation, regardless of the amount of data used for training, our regularized models outperformed the baseline CPC models. Copyright © 2023, The Authors. All rights reserved.

关键词： Supervised learning

MultiVENT: multilingual videos of events with aligned natural text 23

学校读者我要写书评

暂无评论

MultiVENT: multilingual videos of events with aligned natura...

Proceedings of the 37th International Conference on Neural Information processing Systems

作者： Kate Sanders David Etter Reno Kriz Benjamin Van Durme Johns Hopkins University Human Language Technology Center of Excellence

Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.

关键词：

PQLM - Multilingual Decentralized Portable Quantum language Model

学校读者我要写书评

暂无评论

PQLM - Multilingual Decentralized Portable Quantum Language ...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Shuyue Stella Li Xiangyu Zhang Shu Zhou Hongchao Shu Ruixing Liang Hexin Liu Leibny Paola Garcia Center for Language and Speech Processing Johns Hopkins University Department of Physics Hong Kong University of Science and Technology School of Electrical and Electronic Engineering Nanyang Technological University Human Language Technology Center of Excellence Johns Hopkins University

With careful manipulation, malicious agents can reverse engineer private information encoded in pre-trained language models. Security concerns motivate the development of quantum pre-training. In this work, we propose a highly portable quantum language model (PQLM) that can easily transmit information to downstream tasks on classical machines. The framework consists of a cloud PQLM built with random Variational Quantum Classifiers (VQC) and local models for downstream applications. We demonstrate the ad hoc portability of the quantum model by extracting only the word embeddings and effectively applying them to downstream tasks on classical machines. Our PQLM exhibits comparable performance to its classical counterpart on both intrinsic evaluation (loss, perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy) metrics. We also perform ablation studies on the factors affecting PQLM performance to analyze model stability. Our work establishes a theoretical foundation for a portable quantum pre-trained language model that could be trained on private data and made available for public use with privacy protection guarantees.

关键词： Training Sentiment analysis Social networking (online) Computational modeling Signal processing Stability analysis Servers

Refining Self-Supervised Learnt speech Representation using Brain Activations

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Li, Hengyu Mei, Kangdi Liu, Zhaoci Ai, Yang Chen, Liping Zhang, Jie Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification. One can then consider the proposed method as a new alternative to improve self-supervised speech models. Copyright © 2024, The Authors. All rights reserved.

关键词： Chemical activation