检索结果-内蒙古大学图书馆

Learning speaker embedding from text-to-speech

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Cho, Jaejin Zelasko, Piotr Villalba, Jesús Watanabe, Shinji Dehak, Najim Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

Zero-shot multi-speaker Text-to-speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28% and 0.73% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively. Copyright © 2020, The Authors. All rights reserved.

关键词： Embeddings

Single channel far field feature enhancement for speaker verification in the wild

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Nidadavolu, Phani Sankar Kataria, Saurabh Perera, Paola Garcia Villalba, Jesus Dehak, Najim Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Adversarial Network (CycleGAN), which maps features from far-field domain to the speaker embedding training domain. This was trained on un-paired data in an unsupervised manner. Both networks, termed Supervised Enhancement Network (SEN) and Domain Adaptation Network (DAN) respectively, were trained with multi-Task objectives in (filter-bank) feature domain. On a simulated test setup, we first note the benefit of using feature mapping (FM) loss along with adversarial loss in SEN. Then, we tested both supervised and unsupervised approaches on several real noisy datasets. We observed relative improvements ranging from 2% to 31% in terms of DCF. Using three training schemes, we also establish the effectiveness of the novel DAN approach. Copyright © 2020, The Authors. All rights reserved.

关键词： speech enhancement

Wake Word Detection with Streaming Transformers

学校读者我要写书评

暂无评论

Wake Word Detection with Streaming Transformers

IEEE International Conference on Acoustics, speech and Signal processing

作者： Yiming Wang Hang Lv Daniel Povey Lei Xie Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University Baltimore MD USA School of Computer Science Northwestern Polytechnical University Xi’an China Xiaomi Corporation Beijing China Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD USA

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

关键词： Tensors Convolution System performance Conferences Neural networks Acoustics Complexity theory

Creating Multimedia Summaries Using Tweets and Videos

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Andy, Anietie Liu, Siyi Ippolito, Daphne Kriz, Reno Callison-Burch, Chris Wijaya, Derry Penn Medicine University of Pennsylvania United States Human Language Technology Center of Excellence Johns Hopkins University United States Boston University United States

While popular televised events such as presidential debates or TV shows are airing, people provide commentary on them in real-time. In this paper, we propose a simple yet effective approach to combine social media commentary and videos to create a multimedia summary of televised events. Our approach identifies scenes from these events based on spikes of mentions of people involved in the event and automatically selects tweets and frames from the videos that occur during the time period of the spike that talk about and show the people being discussed. © 2022, CC BY.

关键词：

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Arora, Ashish Raj, Desh Subramanian, Aswin Shanmugam Li, Ke Ben-Yair, Bar Maciejewski, Matthew Zelasko, Piotr García, Paola Watanabe, Shinji Khudanpur, Sanjeev Center for Language and Speech Processing & Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States

This paper summarizes the JHU team’s efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for automatic speech recognition (ASR). We also report results with different acoustic model architectures, and integrate other techniques such as online multi-channel weighted prediction error (WPE) dereverberation and variational Bayes-hidden Markov model (VB-HMM) based overlap assignment to deal with reverberation and overlapping speakers, respectively. As a result of these efforts, our ASR systems achieve a word error rate of 40.5% and 67.5% on tracks 1 and 2, respectively, on the evaluation set. This is an improvement of 10.8% and 10.4% absolute, over the challenge baselines for the respective tracks. Copyright © 2020, The Authors. All rights reserved.

关键词： Microphones

Alzheimer’s Together with Mild Cognitive Impairment Screening Using Polar Transformation of Middle Zone of Fundus Images Based Deep Learning

学校读者我要写书评

暂无评论

Alzheimer’s Together with Mild Cognitive Impairment Screeni...

Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

作者： G. Luengnaruemitchai W. Kaewmahanin A. Munthuli P. Phienphanich S. Puangarom S. Sangchocanonta S. Jariyakosol P. Hirunwiwatkul C. Tantibundhit Center of Excellence in Intelligent Informatics Speech and Language Technology and Service Innovation (CILS) Faculty of Engineering Thammasat School of Engineering Thammasart University Bangkok Thailand CILS and International School Bangkok Thailand Department of Ophthalmology Faculty of Medicine Chulalongkorn University Bangkok Thailand

Alzheimer’s disease (AD) and Mild Cognitive Impairment (MCI) are considered an increasing major health problem in elderlies. However, current clinical methods of Alzheimer’s detection are expensive and difficult to access, making the detection inconvenient and unsuitable for developing countries such as Thailand. Thus, we developed a method of AD together with MCI screening by fine-tuning a pre-trained Densely Connected Convolutional Network (DenseNet-121) model using the middle zone of polar transformed fundus image. The polar transformation in the middle zone of the fundus is a key factor helping the model to extract features more effectively and that enhances the model accuracy. The dataset was divided into 2 groups: normal and abnormal (AD and MCI). This method can classify between normal and abnormal patients with 96% accuracy, 99% sensitivity, 90% specificity, 95% precision, and 97% F1 score. Parts of both MCI and AD input images that most impact the classification score visualized by Grad-CAM++ focus in superior and inferior retinal *** relevance– The parts of both MCI and AD input images that have the most impact the classification score (visualized by Grad-CAM++) are superior and inferior retinal quadrants. Polar transformation of the middle zone of retinal fundus images is a key factor that enhances the classification accuracy.

关键词：

MegaWika: Millions of reports and their sources across 50 diverse languages

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Barham, Samuel Weller, Orion Yuan, Michelle Murray, Kenton Yarmohammadi, Mahsa Jiang, Zhengping Vashishtha, Siddharth Martin, Alexander Liu, Anqi White, Aaron Steven Boyd-Graber, Jordan Van Durme, Benjamin Human Language Technology Center of Excellence Johns Hopkins University United States Johns Hopkins University United States University of Maryland College Park United States University of Rochester United States Amazon UMD United States

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval. © 2023, CC BY-SA.

关键词： Semantics

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from speech

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Cho, Jaejin Villalba, Jesús Moro-Velazquez, Laureano Dehak, Najim Johns Hopkins University BaltimoreMD21218 United States The Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also self-supervised learning techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to downstream tasks (speaker verification, speech emotion recognition, and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in speech emotion recognition. Copyright © 2022, The Authors. All rights reserved.

关键词： Distillation

Speaker diarization using two-pass leave-one-out Gaussian PLDA clustering of DNN embeddings

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Karra, Kiran McCree, Alan Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

Many modern systems for speaker diarization, such as the recently-developed VBx approach, rely on clustering of DNN speaker embeddings followed by resegmentation. Two problems with this approach are that the DNN is not directly optimized for this task, and the parameters need significant retuning for different applications. We have recently presented progress in this direction with a Leave-One-Out Gaussian PLDA (LGP) clustering algorithm and an approach to training the DNN such that embeddings directly optimize performance of this scoring method. This paper presents a new two-pass version of this system, where the second pass uses finer time resolution to significantly improve overall performance. For the Callhome corpus, we achieve the first published error rate below 4% without any task-dependent parameter tuning. We also show significant progress towards a robust single solution for multiple diarization tasks. Copyright © 2021, The Authors. All rights reserved.

关键词： Embeddings