检索结果-内蒙古大学图书馆

End-to-end multi-speaker speech recognition with transformer

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Chang, Xuankai Zhang, Wangyou Qian, Yanmin Le Roux, Jonathan Watanabe, Shinji Center for Language and Speech Processing Johns Hopkins University United States MoE Key Lab of Artificial Intelligence &SpeechLab Department of Computer Science and Engineering Shanghai Jiao Tong University China United States

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Copyright © 2020, The Authors. All rights reserved.

关键词： Beamforming

Wake Word Detection with Alignment-Free Lattice-Free MMI

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Wang, Yiming Lv, Hang Povey, Daniel Xie, Lei Khudanpur, Sanjeev Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Xiaomi Inc. Beijing China ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China

Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word;(ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance;(iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%–90% reduction in false rejection rates at prespecified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set. Copyright © 2020, The Authors. All rights reserved.

关键词： Wakes

Using ASR methods for OCR 15

学校读者我要写书评

暂无评论

Using ASR methods for OCR

15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019

作者： Arora, Ashish Garcia, Paola Watanabe, Shinji Manohar, Vimal Shao, Yiwen Khudanpur, Sanjeev Chang, Chun Chieh Rekabdar, Babak Babaali, Bagher Povey, Daniel Etter, David Raj, Desh Hadian, Hossein Trmal, Jan Center for Language and Speech Processing Johns Hopkins University Baltimore United States Human Language Technology Center of Excellence Johns Hopkins University Baltimore United States Department of Computer Engineering Sharif University of Technology Iran School of Mathematics Statistics and Computer Sciences College of Science University of Tehran Iran

ISBN: (纸本)9781728128610

Hybrid deep neural network hidden Markov models (DNN-HMM) have achieved impressive results on large vocabulary continuous speech recognition (LVCSR) tasks. However, the recent approaches using DNN-HMM models are not explored much for text recognition. Inspired by the current work in automatic speech recognition (ASR) and machine translation, we present an open vocabulary sub-word text recognition system. The sub-word lexicon and sub-word language model (LM) helps in overcoming the challenge of recognizing out of vocabulary (OOV) words, and a time delay neural network (TDNN) and convolution neural network (CNN) based DNN-HMM optical model (OM) efficiently models the sequence dependency in the line image. We present results on 12 datasets with training data varying from 6k lines to 600k lines. The system is built for 8 languages, i.e., English, French, Arabic, Chinese, Farsi, Tamil, Russian, and Korean. We report competitive results on several commonly used handwritten and printed text datasets. © 2019 IEEE.

关键词： Hidden Markov models

End-to-end far-field speech recognition with unified dereverberation and beamforming

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhang, Wangyou Subramanian, Aswin Shanmugam Chang, Xuankai Watanabe, Shinji Qian, Yanmin MoE Key Lab of Artificial Intelligence & SpeechLab Department of Computer Science and Engineering AI Institute Shanghai Jiao Tong University Shanghai China Center for Language and Speech Processing Johns Hopkins University United States

Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios. Copyright © 2020, The Authors. All rights reserved.

关键词： Beamforming

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Shi, Jing Chang, Xuankai Guo, Pengcheng Watanabe, Shinji Fujita, Yusuke Xu, Jiaming Xu, Bo Xie, Lei Center for Language and Speech Processing Johns Hopkins University Beijing China ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China Hitachi Ltd. Research & Development Group

Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a conditional multi-sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule. Based on this extension, our model can conditionally infer output sequences one-by-one by making use of both input and previously-estimated contextual output sequences. This model additionally has a simple and efficient stop criterion for the end of the transduction, making it able to infer the variable number of output sequences. We take speech data as a primary test field to evaluate our methods since the observed speech data is often composed of multiple sources due to the nature of the superposition principle of sound waves. Experiments on several different tasks including speech separation and multi-speaker speech recognition show that our conditional multi-sequence models lead to consistent improvements over the conventional non-conditional models. Copyright © 2020, The Authors. All rights reserved.

关键词： Mapping

Isolating Host Environment by Booting Android from OTG Devices

学校读者我要写书评

暂无评论

Chinese Journal of Electronics 2018年第3期27卷 617-624页

作者： XUE Yuan ZHANG Xiaosong YU Xiao ZHANG Yaoyuan TAN Yu'an LI Yuanzhang School of Computer Science and Technology Beijing Institute of Technology Department of Computer Science and Technology Tangshan University Research Center of Massive Language Information Processing and Cloud Computing Application

With the integration of smartphone into daily life, end users store a large amount of sensitive information into Android device. For protecting the sensitive information, a method of multi-booting Android OS from On-The-Go(OTG) device is proposed to meet the requirements of end users in different scenarios. The proposed method utilizes system domain isolation to guarantee the security of sensitive information on different Android *** difference with other solutions is that our proposed solution does not add additional components to Android OS,which makes the overhead of Android runtime has been effectively controlled. A prototype of the proposed method is implemented and deployed into the real android device to evaluate the effectiveness, the efficiency and the performance overhead. The experiment results show that the performance overhead is reasonable and our method can effectively mitigate the risk of sensitive information leakage when booting different Android instance in the same Android device.

关键词： Android security Privacy protection System domain isolation Multi-booting

Machine learning-based longitudinal prediction for GJB2-related sensorineural hearing loss

学校读者我要写书评

暂无评论

computers in Biology and Medicine 2024年 176卷 108597-108597页

作者： Chen, Pey-Yu Yang, Ta-Wei Tseng, Yi-Shan Tsai, Cheng-Yu Yeh, Chiung-Szu Lee, Yen-Hui Lin, Pei-Hsuan Lin, Ting-Chun Wu, Yu-Jen Yang, Ting-Hua Chiang, Yu-Ting Hsu, Jacob Shu-Jui Hsu, Chuan-Jen Chen, Pei-Lung Chou, Chen-Fu Wu, Chen-Chi Department of Otolaryngology MacKay Memorial Hospital Taipei Taiwan Department of Audiology and Speech-Language Pathology Mackay Medical College New Taipei City Taiwan Department of Otolaryngology National Taiwan University Hospital Taipei Taiwan Graduate Institute of Networking and Multimedia National Taiwan University Taipei Taiwan Department of Computer Science & Information Engineering National Taiwan University Taipei Taiwan Graduate Institute of Medical Genomics and Proteomics National Taiwan University College of Medicine Taipei Taiwan Department of Otolaryngology National Taiwan University Biomedical Park Hospital Hsinchu County Taiwan Department of Otolaryngology National Taiwan University Hospital Hsin-Chu Branch Hsinchu City Taiwan Graduate Institute of Clinical Medicine College of Medicine National Taiwan University Taipei Taiwan Department of Otorhinolaryngology-Head and Neck Surgery Taichung Tzu Chi Hospital Buddhist Tzu Chi Medical Foundation Taichung Taiwan Hearing and Speech Center National Taiwan University Hospital Taipei Taiwan School of Medicine Tzu Chi University Hualien Taiwan Department of Medical Genetics National Taiwan University Hospital Taipei Taiwan Department of Medical Research National Taiwan University Hospital Hsin-Chu Branch Hsin-Chu Taiwan

Background: Recessive GJB2 variants, the most common genetic cause of hearing loss, may contribute to progressive sensorineural hearing loss (SNHL). The aim of this study is to build a realistic predictive model for GJB2-related SNHL using machine learning to enable personalized medical planning for timely intervention. Method: Patients with SNHL with confirmed biallelic GJB2 variants in a nationwide cohort between 2005 and 2022 were included. Different data preprocessing protocols and computational algorithms were combined to construct a prediction model. We randomly divided the dataset into training, validation, and test sets at a ratio of 72:8:20, and repeated this process ten times to obtain an average result. The performance of the models was evaluated using the mean absolute error (MAE), which refers to the discrepancy between the predicted and actual hearing thresholds. Results: We enrolled 449 patients with 2184 audiograms available for deep learning analysis. SNHL progression was identified in all models and was independent of age, sex, and genotype. The average hearing progression rate was 0.61 dB HL per year. The best MAE for linear regression, multilayer perceptron, long short-term memory, and attention model were 4.42, 4.38, 4.34, and 4.76 dB HL, respectively. The long short-term memory model performed best with an average MAE of 4.34 dB HL and acceptable accuracy for up to 4 years. Conclusions: We have developed a prognostic model that uses machine learning to approximate realistic hearing progression in GJB2-related SNHL, allowing for the design of individualized medical plans, such as recommending the optimal follow-up interval for this population. © 2024 Elsevier Ltd

关键词： Audition

How phonotactics affect multilingual and zero-shot ASR performance

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Feng, Siyuan Zelasko, Piotr Moro-Velázquez, Laureano Abavisani, Ali Hasegawa-Johnson, Mark Scharenborg, Odette Dehak, Najim Multimedia Computing Group Delft University of Technology Delft Netherlands Center for Language and Speech Processing United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign IL United States

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable. Copyright © 2020, The Authors. All rights reserved.

关键词： speech recognition

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

学校读者我要写书评

暂无评论

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

IEEE International Conference on Acoustics, speech and Signal processing

作者： Y. Fan J.W. Kang L.T. Li K.C. Li H.L. Chen S.T. Cheng P.Y. Zhang Z.Y. Zhou Y.Q. Cai D. Wang Key Laboratory of Transient Physics Nanjing University of Science and Technology China Center for Speech and Language Technologies Tsinghua University China Department of Computer Science and Technology Tsinghua University China Beijing National Research Center for Information Science and Technology China

ISBN: (数字)9781509066315

ISBN: (纸本)9781509066322

Recently, researchers set an ambitious goal of conducting speaker recognition in unconstrained conditions where the variations on ambient, channel and emotion could be arbitrary. However, most publicly available datasets are collected under constrained environments, i.e., with little noise and limited channel variation. These datasets tend to deliver over-optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions. In this paper, we present CN-Celeb, a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world. Experiments conducted with two state-of-the-art speaker recognition approaches (i-vector and x-vector) show that the performance on CN-Celeb is far inferior to the one obtained on Vox-Celeb, a widely used speaker recognition dataset. This result demonstrates that in real-life conditions, the performance of existing techniques might be much worse than it was thought. Our database is free for researchers and can be downloaded from .

关键词：