检索结果-内蒙古大学图书馆

End-to-end multi-speaker speech recognition with transformer

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Chang, Xuankai Zhang, Wangyou Qian, Yanmin Le Roux, Jonathan Watanabe, Shinji Center for Language and Speech Processing Johns Hopkins University United States MoE Key Lab of Artificial Intelligence &SpeechLab Department of Computer Science and Engineering Shanghai Jiao Tong University China United States

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Copyright © 2020, The Authors. All rights reserved.

关键词： Beamforming

Using ASR methods for OCR 15

学校读者我要写书评

暂无评论

Using ASR methods for OCR

15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019

作者： Arora, Ashish Garcia, Paola Watanabe, Shinji Manohar, Vimal Shao, Yiwen Khudanpur, Sanjeev Chang, Chun Chieh Rekabdar, Babak Babaali, Bagher Povey, Daniel Etter, David Raj, Desh Hadian, Hossein Trmal, Jan Center for Language and Speech Processing Johns Hopkins University Baltimore United States Human Language Technology Center of Excellence Johns Hopkins University Baltimore United States Department of Computer Engineering Sharif University of Technology Iran School of Mathematics Statistics and Computer Sciences College of Science University of Tehran Iran

ISBN: (纸本)9781728128610

Hybrid deep neural network hidden Markov models (DNN-HMM) have achieved impressive results on large vocabulary continuous speech recognition (LVCSR) tasks. However, the recent approaches using DNN-HMM models are not explored much for text recognition. Inspired by the current work in automatic speech recognition (ASR) and machine translation, we present an open vocabulary sub-word text recognition system. The sub-word lexicon and sub-word language model (LM) helps in overcoming the challenge of recognizing out of vocabulary (OOV) words, and a time delay neural network (TDNN) and convolution neural network (CNN) based DNN-HMM optical model (OM) efficiently models the sequence dependency in the line image. We present results on 12 datasets with training data varying from 6k lines to 600k lines. The system is built for 8 languages, i.e., English, French, Arabic, Chinese, Farsi, Tamil, Russian, and Korean. We report competitive results on several commonly used handwritten and printed text datasets. © 2019 IEEE.

关键词： Hidden Markov models

End-to-end far-field speech recognition with unified dereverberation and beamforming

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhang, Wangyou Subramanian, Aswin Shanmugam Chang, Xuankai Watanabe, Shinji Qian, Yanmin MoE Key Lab of Artificial Intelligence & SpeechLab Department of Computer Science and Engineering AI Institute Shanghai Jiao Tong University Shanghai China Center for Language and Speech Processing Johns Hopkins University United States

Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios. Copyright © 2020, The Authors. All rights reserved.

关键词： Beamforming

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Raj, Desh Denisov, Pavel Chen, Zhuo Erdogan, Hakan Huang, Zili He, Maokui Watanabe, Shinji Du, Jun Yoshioka, Takuya Luo, Yi Kanda, Naoyuki Li, Jinyu Wisdom, Scott Hershey, John R. Center for Language and Speech Processing Johns Hopkins University BaltimoreMD United States Institute for Natural Language Processing University of Stuttgart Germany Microsoft Corp RedmondWA United States Google Research CambridgeMA United States University of Science and Technology of China HeFei China Department of Electrical Engineering Columbia University NY United States

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR. © 2020, CC-BY.

关键词： speech recognition

A continual learning survey: Defying forgetting in classification tasks

学校读者我要写书评

暂无评论

arXiv 2019年

作者： de Lange, Matthias Aljundi, Rahaf Masana, Marc Parisot, Sarah Jia, Xu Leonardis, Aleš Slabaugh, Gregory Tuytelaars, Tinne Center for Processing Speech and Images Department Electrical Engineering KU Leuven Computer Vision Center UAB Huawei

Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern (1) a taxonomy and extensive overview of the state-of-the-art;(2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner;(3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage. Copyright © 2019, The Authors. All rights reserved.

关键词： Neural networks

speech enhancement via deep spectrum image translation network

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Kashani, Hamidreza Baradaran Goodarzi, Mohammad Mohsen Jodeiri, Ata Rezaei, Iman Sarraf Electrical Engineering Faculty Amirkabir University of Technology Tehran Iran Department of Electrical and Computer Engineering Buein Zahra Technical University Qazvin Iran School of Electrical & Computer Engineering College of Engineering University of Tehran Tehran Iran Speech and Language Processing Group Research Center for Development of Advanced Technologies Tehran Iran

Quality and intelligibility of speech signals are degraded under additive background noise which is a critical problem for hearing aid and cochlear implant users. Motivated to address this problem, we propose a novel speech enhancement approach using a deep spectrum image translation network. To this end, we suggest a new architecture, called VGG19-UNet, where a deep fully convolutional network known as VGG19 is embedded at the encoder part of an image-to-image translation network, i.e. U-Net. Moreover, we propose a perceptually-modified version of the spectrum image that is represented in Mel frequency and power-law non-linearity amplitude domains, representing good approximations of human auditory perception model. By conducting experiments on a real challenge in speech enhancement, i.e. unseen noise environments, we show that the proposed approach outperforms other enhancement methods in terms of both quality and intelligibility measures, represented by PESQ and ESTOI, respectively. Copyright © 2019, The Authors. All rights reserved.

关键词： Audition

speech Enhancement via Deep Spectrum Image Translation Network

学校读者我要写书评

暂无评论

Speech Enhancement via Deep Spectrum Image Translation Netwo...

Iranian Conference of Biomedical engineering (ICBME)

作者： Hamidreza Baradaran Kashani Ata Jodeiri Mohammad Mohsen Goodarzi Iman Sarraf Rezaei Electrical Engineering Faculty Amirkabir University of Technology Tehran Iran School of Electrical & Computer Engineering College of Engineering University of Tehran Tehran Iran Department of Electrical and Computer Engineering Buein Zahra Technical University Qazvin Iran Research Center for Development of Advanced Technologies Speech and Language Processing Group Tehran Iran

ISBN: (数字)9781728156637

ISBN: (纸本)9781728156644

Quality and intelligibility of speech signals are degraded under additive background noise which is a critical problem for hearing aid and cochlear implant users. Motivated to address this problem, we propose a novel speech enhancement approach using a deep spectrum image translation network. To this end, we suggest a new architecture, called VGG19-UNet, where a deep fully convolutional network known as VGG19 is embedded at the encoder part of an image-to-image translation network, i.e. U-Net. Moreover, we propose a perceptuallymodified version of the spectrum image that is represented in Mel frequency and power-law non-linearity amplitude domains, representing good approximations of human auditory perception model. By conducting experiments on a real challenge in speech enhancement, i.e. unseen noise environments, we show that the proposed approach outperforms other enhancement methods in terms of both quality and intelligibility measures, represented by PESQ and ESTOI, respectively.

关键词： speech enhancement Noise measurement Feature extraction Training computer architecture Convolution Frequency-domain analysis

MIMO-speech: End-to-end multi-channel multi-speaker speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Chang, Xuankai Zhang, Wangyou Qian, Yanmin Le Roux, Jonathan Watanabe, Shinji Center for Language and Speech Processing Johns Hopkins University United States SpeechLab Department of Computer Science and Engineering Shanghai Jiao Tong University China United States

Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function. Copyright © 2019, The Authors. All rights reserved.

关键词： Beamforming

Speaker Embedding Extraction with Virtual Phonetic Information

学校读者我要写书评

暂无评论

Speaker Embedding Extraction with Virtual Phonetic Informati...

IEEE Global Conference on Signal and Information processing (GlobalSIP)

作者： S. Sreekanth Shaik Mohammad Rafi B K Sri Rama Murty Saurabhchand Bhati Department of Electronics and Communications Engineering IIIT RK Valley RGUKT-AP Department of Electrical Engineering Indian Institute of Technology Hyderabad India Center for Language and Speech Processing The Johns Hopkins University USA

In the recent past, deep neural networks have been successfully employed to extract fixed-dimensional speaker embeddings from the speech signal. The commonly used x-vectors are extracted by projecting the magnitude spectral features extracted from the speech signal onto a speaker-discriminative space. As the x-vectors do not explicitly capture the speaker-specific phonological pronunciation variability, phonetic vectors extracted from an automatic speech recognition (ASR) engine were supplied as auxiliary information to improve the performance of the x-vector system. However, the development of ASR engine requires a huge amount of manually transcribed speech data. In this paper, we propose to transcribe the speech signal in an unsupervised manner with the cluster labels obtained from a mixture of autoencoders (MoA) trained on a large amount of speech data. The unsupervised labels, referred to as virtual phonetic transcriptions, are used to extract the phonetic vectors. The virtual phonetic vectors extracted using MoA are supplied as auxiliary information to the x-vector system. The performance of the proposed system is compared with the state-of-the-art x-vector system on NIST SRE-2010 data. The proposed unsupervised auxiliary information provides a relative improvement of 12.08%, 3.61% and 16.66% over the x-vector system on core-core, core-10sec and 10sec-10sec conditions, respectively.

关键词：