检索结果-内蒙古大学图书馆

Wake Word Detection with Streaming Transformers

学校读者我要写书评

暂无评论

Wake Word Detection with Streaming Transformers

IEEE International Conference on Acoustics, speech and Signal processing

作者： Yiming Wang Hang Lv Daniel Povey Lei Xie Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University Baltimore MD USA School of Computer Science Northwestern Polytechnical University Xi’an China Xiaomi Corporation Beijing China Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD USA

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

关键词： Tensors Convolution System performance Conferences Neural networks Acoustics Complexity theory

Eye movement patterns are similar during accurate multiple-target tracking

学校读者我要写书评

暂无评论

Eye movement patterns are similar during accurate multiple-t...

International Conference on Cognitive Infocommunications (CogInfoCom)

作者： Kamyar Bagha Shiva Kamkar Hamid Abrishami Moghaddam Lauri Oksama Jie Li Jukka Hyönä Computer Engineering Department Khatam University Tehran Iran Machine Vision and Medical Image Processing (MVMIP) Laboratory Faculty of Electrical Engineering K.N.Toosi University of Technology Tehran Iran Center for International Scientific Studies and Collaboration (CISSC) Tehran Iran Department of Psychology and Speech-Language Pathology University of Turku Turku Finland Center for Cognition and Brain Disorders Hangzhou Normal University Hangzhou China

ISBN: (数字)9798350378245

ISBN: (纸本)9798350378252

Understanding how the brain works is a base of cognitive info-communication. To this aim we focus on multiple target tracking (MTT) as a key task that involves two important cognitive factors, attention and memory. humans track multiple objects in their daily life while facing various challenges including occlusion and set-size. Eye movement research has shown that there are within and between subjects’ differences in scanpaths while performing MTT tasks. However, it is unclear if there is a winning scan pattern that would lead to a successful tracking of targets. To answer this question, we used dynamic time warping to compare the similarities between subjects’ scan patterns during an MTT task with different challenges. We studied the effect of set-size, occlusion, and trial response on the similarities. Then a mixed effect analysis was applied on the output to measure whether the findings were statistically significant. Results demonstrated that scan patterns were more similar when MTT task was performed correctly. It suggests that there is a common tracking strategy adopted by the viewers that leads to a correct response. Decoding this strategy has countless applications in the fields including human-computer interaction, brain-modeling and cognitive info-communication.

关键词： human computer interaction Visualization Target tracking Accuracy Predictive models Communications technology Time measurement Cognition Decoding Cognitive science

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Arora, Ashish Raj, Desh Subramanian, Aswin Shanmugam Li, Ke Ben-Yair, Bar Maciejewski, Matthew Zelasko, Piotr García, Paola Watanabe, Shinji Khudanpur, Sanjeev Center for Language and Speech Processing & Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States

This paper summarizes the JHU team’s efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for automatic speech recognition (ASR). We also report results with different acoustic model architectures, and integrate other techniques such as online multi-channel weighted prediction error (WPE) dereverberation and variational Bayes-hidden Markov model (VB-HMM) based overlap assignment to deal with reverberation and overlapping speakers, respectively. As a result of these efforts, our ASR systems achieve a word error rate of 40.5% and 67.5% on tracks 1 and 2, respectively, on the evaluation set. This is an improvement of 10.8% and 10.4% absolute, over the challenge baselines for the respective tracks. Copyright © 2020, The Authors. All rights reserved.

关键词： Microphones

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Ma, Jun-Yu Chen, Beiduo Gu, Jia-Chen Ling, Zhen-Hua Guo, Wu Liu, Quan Chen, Zhigang Liu, Cong National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China State Key Laboratory of Cognitive Intelligence China iFLYTEK Research Hefei China Jilin Kexun Information Technology Co. Ltd China

Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields. Copyright © 2022, The Authors. All rights reserved.

关键词： Distillation

CopyPaste: An augmentation method for speech emotion recognition

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Pappagari, Raghavendra Villalba, Jesús Zelasko, Piotr Moro-Velazquez, Laureano Dehak, Najim Center for Language and Speech Processing United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictates a speaker’s overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We hypothesize that SER performance can be improved using these concatenated utterances in model training. To verify this, three CopyPaste schemes are tested on two deep learning models: one trained independently and another using transfer learning from an x-vector model, a speaker recognition model. We observed that all three CopyPaste schemes improve SER performance on all the three datasets considered: MSP-Podcast, Crema-D, and IEMOCAP. Additionally, CopyPaste performs better than noise augmentation and, using them together improves the SER performance further. Our experiments on noisy test sets suggested that CopyPaste is effective even in noisy test conditions. Copyright © 2020, The Authors. All rights reserved.

关键词： speech recognition

OOV Recovery with Efficient 2nd Pass Decoding and Open-vocabulary Word-level RNNLM Rescoring for Hybrid ASR

学校读者我要写书评

暂无评论

OOV Recovery with Efficient 2nd Pass Decoding and Open-vocab...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Xiaohui Zhang Daniel Povey Sanjeev Khudanpur Facebook AI US Center for Language and Speech Processing & Human Language Technology Center of Excellence The Johns Hopkins University Baltimore MD US

ISBN: (数字)9781509066315

ISBN: (纸本)9781509066322

In this paper, we investigate out-of-vocabulary (OOV) word recovery in hybrid automatic speech recognition (ASR) systems, with emphasis on dynamic vocabulary expansion for both Weight Finite State Transducer (WFST)-based decoding and word-level RNNLM rescoring. We first describe our OOV candidate generation method based on a hybrid lexical model (HLM) with phoneme-sequence constraints. Next, we introduce a framework for efficient second pass OOV recovery with a dynamically expanded vocabulary, showing that, by calibrating OOV candidates' language model (LM) scores, it significantly improves OOV recovery and overall decoding performance compared to HLM-based first pass decoding. Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re-scoring framework, making it possible to re-score ASR hypotheses containing recovered OOVs, using a single word-level RNNLM ignorant of OOVs when it was trained. By evaluating OOV recovery and overall decoding performance on Spanish/English ASR `tasks, we show the proposed OOV recovery pipeline has the potential of an efficient open-vocab word-based ASR decoding framework, with minimal extra computation versus a standard WFST based decoding and RNNLM rescoring pipeline.

关键词：

Multi-class spectral clustering with overlaps for speaker diarization

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Raj, Desh Huang, Zili Khudanpur, Sanjeev Center for Language and Speech Processing United States Human Language Technology Center of Excellence The Johns Hopkins University BaltimoreMD21218 United States

This paper describes a method for overlap-aware speaker diarization. Given an overlap detector and a speaker embedding extractor, our method performs spectral clustering of segments informed by the output of the overlap detector. This is achieved by transforming the discrete clustering problem into a convex optimization problem which is solved by eigen-decomposition. Thereafter, we discretize the solution by alternatively using singular value decomposition and a modified version of non-maximal suppression which is constrained by the output of the overlap detector. Furthermore, we detail an HMM-DNN based overlap detector which performs frame-level classification and enforces duration constraints through HMM state transitions. Our method achieves a test diarization error rate (DER) of 24.0% on the mixed-headset setting of the AMI meeting corpus, which is a relative improvement of 15.2% over a strong agglomerative hierarchical clustering baseline, and compares favorably with other overlap-aware diarization methods. Further analysis on the LibriCSS data demonstrates the effectiveness of the proposed method in high overlap conditions. © 2020, CC-BY.

关键词： Clustering algorithms

Integration of speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

学校读者我要写书评

暂无评论

Integration of Speech Separation, Diarization, and Recogniti...

IEEE Spoken language technology Workshop

作者： Desh Raj Pavel Denisov Zhuo Chen Hakan Erdogan Zili Huang Maokui He Shinji Watanabe Jun Du Takuya Yoshioka Yi Luo Naoyuki Kanda Jinyu Li Scott Wisdom John R. Hershey Center for Language and Speech Processing The Johns Hopkins University Baltimore MD Institute for Natural Language Processing University of Stuttgart Germany Microsoft Corp Redmond WA Google Research Cambridge MA University of Science and Technology of China HeFei China Columbia University NY

ISBN: (数字)9781728170664

ISBN: (纸本)9781728170671

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

关键词： Measurement Error analysis Conferences Pipelines speech recognition Task analysis Automatic speech recognition

Discovering Phonetic Inventories with Crosslingual Automatic speech Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Żelasko, Piotr Feng, Siyuan Velázquez, Laureano Moro Abavisani, Ali Bhati, Saurabhchand Scharenborg, Odette Hasegawa-Johnson, Mark Dehak, Najim Center of Language and Speech Processing The Johns Hopkins University 3400 North Charles Street BaltimoreMD21218 United States Human Language Technology Center of Excellence The Johns Hopkins University 810 Wyman Park Drive BaltimoreMD21218 United States Multimedia Computing Group Delft University of Technology Van Mourik Broekmanweg 6 Delft2628 XE Netherlands Department of Electrical and Computer Engineering University of Illinois 405 N Mathews UrbanaIL61801 United States

The high cost of data acquisition makes Automatic speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we 1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language;2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation;and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery. © 2022, CC BY-NC-ND.

关键词： Telephone sets