检索结果-内蒙古大学图书馆

Cumulative adaptation for BLSTM acoustic models

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Kitza, Markus Golik, Pavel Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

This paper addresses the robust speech recognition problem as an adaptation task. Specifically, we investigate the cumulative application of adaptation methods. A bidirectional Long Short-Term Memory (BLSTM) based neural network, capable of learning temporal relationships and translation invariant representations, is used for robust acoustic modeling. Further, i-vectors were used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 8% relative improvement in word error rate on the NIST Hub5 2000 evaluation testset. By enhancing the first-pass i-vector based adaptation with a second-pass adaptation using speaker and environment dependent transformations within the network, a further relative improvement of 5% in word error rate was achieved. We have reevaluated the features used to estimate i-vectors and their normalization to achieve the best performance in a modern large scale automatic speech recognition system. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

On using 2D sequence-to-sequence models for speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences. The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions. The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

Sample drop detection for distant-speech recognition with asynchronous devices distributed in space

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Raissi, Tina Pascual, Santiago Omologo, Maurizio Human Language Technology and Pattern Recognition RWTH Aachen University Aachen Germany Universitat Politècnica de Catalunya Barcelona Spain Center for Information and Communication Technology Fondazione Bruno Kessler Trento Italy

In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based model that given a short window in the time domain, detects whether one or more devices have been subjected to a sample drop event. The candidate time windows are selected from a set of large time intervals, possibly including a sample drop, and by using a preprocessing step. The latter is based on the application of normalized cross-correlation between signals acquired by different devices. The architecture of the neural network relies on a CNN-LSTM encoder, followed by multi-head attention. The experiments are conducted using both artificial and real data. Our proposed approach obtained F1 score of 88% on an evaluation set extracted from the CHiME-5 corpus. A comparable performance was found in a larger set of experiments conducted on a set of multi-channel artificial scenes. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

language modeling with deep transformers

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

Calibration of Deep Probabilistic Models with Decoupled Bayesian Neural Networks

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Maroñas, Juan Paredes, Roberto Ramos, Daniel PRHLT - Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de Valencia Spain AUDIAS - Audio Data Intelligence and Speech Universidad Autónoma de Madrid Spain

Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy performance in many tasks. However, recent works have pointed out that the outputs provided by these models are not well-calibrated, seriously limiting their use in critical decision scenarios. In this work, we propose to use a decoupled Bayesian stage, implemented with a Bayesian Neural Network (BNN), to map the uncalibrated probabilities provided by a DNN to calibrated ones, consistently improving calibration. Our results evidence that incorporating uncertainty provides more reliable probabilistic models, a critical condition for achieving good calibration. We report a generous collection of experimental results using high-accuracy DNNs in standardized image classification benchmarks, showing the good performance, flexibility and robust behavior of our approach with respect to several state-of-the-art calibration methods. Code for reproducibility is provided. Copyright © 2019, The Authors. All rights reserved.

关键词： Calibration

Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing 4

学校读者我要写书评

暂无评论

Advances on the Transcription of Historical Manuscripts base...

4th International Conference on Advances in Speech and language Technologies for Iberian languages, IberSPEECH 2018

作者： Granell, Emilio Martínez-Hinarejos, Carlos-D. Romero, Verónica Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València Camí de Vera s/n València46022 Spain

The transcription of digitalised documents is useful to ease the digital access to their contents. Natural language technologies, such as Automatic Speech recognition (ASR) for speech audio signals and Handwritten Text recognition (HTR) for text images, have become common tools for assisting transcribers, by providing a draft transcription from the digital document that they may amend. This draft is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. The work described in this thesis is focused on the improvement of the transcription offered by an HTR system from three scenarios: multimodality, interactivity and crowdsourcing. The image transcription can be obtained by dictating their textual contents to an ASR system. Besides, when both sources of information (image and speech) are available, a multimodal combination is possible, and this can be used to provide assistive systems with additional sources of information. Moreover, speech dictation can be used in a multimodal crowdsourcing platform, where collaborators may provide their speech by using mobile devices. Different solutions for each scenario were tested on two Spanish historical manuscripts, obtaining statistically significant improvements. © 4th International Conference, IberSPEECH 2018.

关键词： Crowdsourcing

Improving Transcription of Manuscripts with Multimodality and Interaction 4

学校读者我要写书评

暂无评论

Improving Transcription of Manuscripts with Multimodality an...

4th International Conference on Advances in Speech and language Technologies for Iberian languages, IberSPEECH 2018

State-of-the-art Natural language recognition systems allow transcribers to speed-up the transcription of audio, video or image documents. These systems provide transcribers an initial draft transcription that can be corrected with less effort than transcribing the documents from scratch. However, even the drafts offered by the most advanced systems based on Deep Learning contain errors. Therefore, the supervision of those drafts by a human transcriber is still necessary to obtain the correct transcription. This supervision can be eased by using interactive and assistive transcription systems, where the transcriber and the automatic system cooperate in the amending process. Moreover, the interactive system can combine different sources of information in order to improve their performance, such as text line images and the dictation of their textual contents. In this paper, the performance of a multimodal interactive and assistive transcription system is evaluated on one Spanish historical manuscript. Although the quality of the draft transcriptions provided by a Handwriting Text recognition system based on Deep Learning is pretty good, the proposed interactive and assistive approach reveals an additional reduction of transcription effort. Besides, this effort reduction is increased when using speech dictations over an Automatic Speech recognition system, allowing for a faster transcription process. © 4th International Conference, IberSPEECH 2018.

关键词： human computer interaction

Improved training of end-to-end attention models for speech recognition 19

学校读者我要写书评

暂无评论

Improved training of end-to-end attention models for speech ...

19th Annual Conference of the International Speech Communication, INTERSPEECH 2018

作者： Zeyer, Albert Irie, Kazuki Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52062 Germany AppTek United States NNAISENSE Switzerland

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model. © 2018 International Speech Communication Association. All rights reserved.

关键词： Speech recognition

On the choice of modeling unit for sequence-to-sequence speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Prabhavalkar, Rohit Kannan, Anjuli Bruguier, Antoine Rybach, David Nguyen, Patrick Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University AachenD-52056 Germany Google Mountain ViewCA94043 United States