检索结果-内蒙古大学图书馆

OOV Recovery with Efficient 2nd Pass Decoding and Open-vocabulary Word-level RNNLM Rescoring for Hybrid ASR

学校读者我要写书评

暂无评论

OOV Recovery with Efficient 2nd Pass Decoding and Open-vocab...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Xiaohui Zhang Daniel Povey Sanjeev Khudanpur Facebook AI US Center for Language and Speech Processing & Human Language Technology Center of Excellence The Johns Hopkins University Baltimore MD US

ISBN: (数字)9781509066315

ISBN: (纸本)9781509066322

In this paper, we investigate out-of-vocabulary (OOV) word recovery in hybrid automatic speech recognition (ASR) systems, with emphasis on dynamic vocabulary expansion for both Weight Finite State Transducer (WFST)-based decoding and word-level RNNLM rescoring. We first describe our OOV candidate generation method based on a hybrid lexical model (HLM) with phoneme-sequence constraints. Next, we introduce a framework for efficient second pass OOV recovery with a dynamically expanded vocabulary, showing that, by calibrating OOV candidates' language model (LM) scores, it significantly improves OOV recovery and overall decoding performance compared to HLM-based first pass decoding. Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re-scoring framework, making it possible to re-score ASR hypotheses containing recovered OOVs, using a single word-level RNNLM ignorant of OOVs when it was trained. By evaluating OOV recovery and overall decoding performance on Spanish/English ASR `tasks, we show the proposed OOV recovery pipeline has the potential of an efficient open-vocab word-based ASR decoding framework, with minimal extra computation versus a standard WFST based decoding and RNNLM rescoring pipeline.

关键词：

Multi-class spectral clustering with overlaps for speaker diarization

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Raj, Desh Huang, Zili Khudanpur, Sanjeev Center for Language and Speech Processing United States Human Language Technology Center of Excellence The Johns Hopkins University BaltimoreMD21218 United States

This paper describes a method for overlap-aware speaker diarization. Given an overlap detector and a speaker embedding extractor, our method performs spectral clustering of segments informed by the output of the overlap detector. This is achieved by transforming the discrete clustering problem into a convex optimization problem which is solved by eigen-decomposition. Thereafter, we discretize the solution by alternatively using singular value decomposition and a modified version of non-maximal suppression which is constrained by the output of the overlap detector. Furthermore, we detail an HMM-DNN based overlap detector which performs frame-level classification and enforces duration constraints through HMM state transitions. Our method achieves a test diarization error rate (DER) of 24.0% on the mixed-headset setting of the AMI meeting corpus, which is a relative improvement of 15.2% over a strong agglomerative hierarchical clustering baseline, and compares favorably with other overlap-aware diarization methods. Further analysis on the LibriCSS data demonstrates the effectiveness of the proposed method in high overlap conditions. © 2020, CC-BY.

关键词： Clustering algorithms

The attentional bias of gelotophobes towards emotion words containing the Chinese character for ‘laugh’: An eye-tracking approach

学校读者我要写书评

暂无评论

Current Psychology 2023年第19期42卷 16330-16343页

作者： Lee, Yen-Lin Chen, Hsueh-Chih Chan, Yu-Chen Institute of Learning Sciences and Technologies National Tsing Hua University Hsinchu Taiwan Department of Educational Psychology and Counseling National Taiwan Normal University Taipei Taiwan Institute for Research Excellence in Learning Sciences National Taiwan Normal University Taipei Taiwan Chinese Language and Technology Center National Taiwan Normal University Taipei Taiwan MOST AI Biomedical Research Center Taipei Taiwan Department of Educational Psychology and Counseling National Tsing Hua University 101 Sec. 2 Kuang Fu Road Hsinchu 30013 Taiwan Cognitive and Human Affective Neuroscience Laboratory CHAN Lab NTHU Hsinchu Taiwan Research Center for Education and Mind Sciences NTHU Hsinchu Taiwan

Gelotophobes are typically characterized by the fear of laughter, social withdrawal, and humorlessness, possibly related to negative experiences of being laughed at in the past. The present study seeks to expand our understanding of gelotophobia through a relatively novel approach: using eye-tracking to investigate the attentional bias of gelotophobes and non-gelotophobes towards negative emotion words that do and do not contain the Chinese character for “laugh,” by comparing responses to negative ridicule words (RID), negative contempt words (CONT), positive pleasure words (PLE) and neutral words (NEU). Results of the start time of the first run of fixations showed that gelotophobes and non-gelotophobes both focused on negative words before other words. Gelotophobes’ attentional bias towards RID and CONT was greater than that of non-gelotophobes in first gaze duration, percentage of total viewing duration, total fixation count, and run count, suggesting that gelotophobes had greater difficulty in disengaging their attention from negative to neutral words. Non-gelotophobes’ attentional bias, however, towards negative ridicule neutral words (RID-NEU) and negative contempt neutral words (CONT-NEU) was greater than that of gelotophobes, suggesting that non-gelotophobes were more able to shift attention from negative to neutral words. Moreover, gelotophobes paid significantly more attention to RID than CONT, suggesting that gelotophobes displayed a longer and stronger attentional bias towards RID (containing the “laugh” character). Interestingly, there was no difference for PLE between gelotophobes and non-gelotophobes. The present study contributes to our understanding of the attentional bias of gelotophobes and non-gelotophobes towards emotion words. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.

关键词： Contempt words Emotion Eye movement Humor Laughter Negative attentional bias Ridicule words

Eye movement patterns are similar during accurate multiple-target tracking

学校读者我要写书评

暂无评论

Eye movement patterns are similar during accurate multiple-t...

International Conference on Cognitive Infocommunications (CogInfoCom)

作者： Kamyar Bagha Shiva Kamkar Hamid Abrishami Moghaddam Lauri Oksama Jie Li Jukka Hyönä Computer Engineering Department Khatam University Tehran Iran Machine Vision and Medical Image Processing (MVMIP) Laboratory Faculty of Electrical Engineering K.N.Toosi University of Technology Tehran Iran Center for International Scientific Studies and Collaboration (CISSC) Tehran Iran Department of Psychology and Speech-Language Pathology University of Turku Turku Finland Center for Cognition and Brain Disorders Hangzhou Normal University Hangzhou China

ISBN: (数字)9798350378245

ISBN: (纸本)9798350378252

Understanding how the brain works is a base of cognitive info-communication. To this aim we focus on multiple target tracking (MTT) as a key task that involves two important cognitive factors, attention and memory. humans track multiple objects in their daily life while facing various challenges including occlusion and set-size. Eye movement research has shown that there are within and between subjects’ differences in scanpaths while performing MTT tasks. However, it is unclear if there is a winning scan pattern that would lead to a successful tracking of targets. To answer this question, we used dynamic time warping to compare the similarities between subjects’ scan patterns during an MTT task with different challenges. We studied the effect of set-size, occlusion, and trial response on the similarities. Then a mixed effect analysis was applied on the output to measure whether the findings were statistically significant. Results demonstrated that scan patterns were more similar when MTT task was performed correctly. It suggests that there is a common tracking strategy adopted by the viewers that leads to a correct response. Decoding this strategy has countless applications in the fields including human-computer interaction, brain-modeling and cognitive info-communication.

关键词： human computer interaction Visualization Target tracking Accuracy Predictive models Communications technology Time measurement Cognition Decoding Cognitive science

MMMORRF: Multimodal Multilingual MOdularized Reciprocal Rank Fusion

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Samuel, Saron Degenaro, Dan Guallar-Blasco, Jimena Sanders, Kate Eisape, Oluwaseun Spendlove, Tanner Reddy, Arun Martin, Alexander Yates, Andrew Yang, Eugene Carpenter, Cameron Etter, David Kayi, Efsun Wiesner, Matthew Murray, Kenton Kriz, Reno Stanford University StanfordCA United States Georgetown University WashingtonDC United States University of California Berkeley BerkeleyCA United States Johns Hopkins University United States Applied Physics Laboratory BaltimoreMD United States Human Language Technology Center of Excellence BaltimoreMD United States

Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and languageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users’ information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities. © 2025, CC BY.

关键词： Image coding

Discovering Phonetic Inventories with Crosslingual Automatic speech Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Żelasko, Piotr Feng, Siyuan Velázquez, Laureano Moro Abavisani, Ali Bhati, Saurabhchand Scharenborg, Odette Hasegawa-Johnson, Mark Dehak, Najim Center of Language and Speech Processing The Johns Hopkins University 3400 North Charles Street BaltimoreMD21218 United States Human Language Technology Center of Excellence The Johns Hopkins University 810 Wyman Park Drive BaltimoreMD21218 United States Multimedia Computing Group Delft University of Technology Van Mourik Broekmanweg 6 Delft2628 XE Netherlands Department of Electrical and Computer Engineering University of Illinois 405 N Mathews UrbanaIL61801 United States

The high cost of data acquisition makes Automatic speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we 1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language;2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation;and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery. © 2022, CC BY-NC-ND.

关键词： Telephone sets

An asynchronous wfst-based decoder for automatic speech recognition

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Lv, Hang Chen, Zhehuai Xu, Hainan Povey, Daniel Xie, Lei Khudanpur, Sanjeev School of Computer Science Northwestern Polytechnical University Xi'an China Center of Language and Speech Processing United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Xiaomi Corporation Beijing China SpeechLab Department of Computer Science and Engineering Shanghai Jiao Tong University China

We introduce asynchronous dynamic decoder, which adopts an efficient A∗ algorithm to incorporate big language models in the onepass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard onepass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity. Copyright © 2021, The Authors. All rights reserved.

关键词： Decoding

Frustratingly easy noise-aware training of acoustic models

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Raj, Desh Villalba, Jesús Povey, Daniel Khudanpur, Sanjeev Center for Language and Speech Processing & Human Language Technology Center of Excellence The Johns Hopkins University BaltimoreMD21218 United States Xiaomi Corp. Beijing China

Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of acoustic models in hybrid ASR. Our noise vectors are obtained by combining the means of speech frames and silence frames in the utterance, where the speech/silence labels may be obtained from a GMM-HMM model trained for ASR alignments, such that no extra computation is required beyond averaging of feature vectors. We show through experiments on AMI and Aurora-4 that this simple adaptation technique can result in 6-7% relative WER improvement. We implement several embedding-based adaptation baselines proposed in literature, and show that our method outperforms them on both the datasets. Finally, we extend our method to the online ASR setting by using frame-level maximum likelihood for the mean estimation. © 2020, CC BY.

关键词： speech recognition

Mixture of speaker-type PLDAs for children's speech diarization

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Xie, Jiamin Sia, Suzanna García, Paola Povey, Daniel Khudanpur, Sanjeev Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States Xiaomi Corp. Beijing China

In diarization, the PLDA is typically used to model an inference structure which assumes the variation in speech segments be induced by various speakers. The speaker variation is then learned from the training data. However, human perception can differentiate speakers by age, gender, among other characteristics. In this paper, we investigate a speaker-type informed model that explicitly captures the known variation of speakers. We explore a mixture of three PLDA models, where each model represents an adult female, male, or child category. The weighting of each model is decided by the prior probability of its respective class, which we study. The evaluation is performed on a subset of the BabyTrain corpus. We examine the expected performance gain using the oracle speaker type labels, which yields an 11.7% DER reduction. We introduce a novel baby vocalization augmentation technique and then compare the mixture model to the single model. Our experimental result shows an effective 0.9% DER reduction obtained by adding vocalizations. We discover empirically that a balanced dataset is important to train the mixture PLDA model, which outperforms the single PLDA by 1.3% using the same training data and achieving a 35.8% DER. The same setup improves over a standard baseline by 2.8% DER. Index Terms: speaker diarization, children’s speech, transformer encoder, mixture of PLDAs Copyright © 2020, The Authors. All rights reserved.

关键词： Mixtures