With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like ...
详细信息
Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (AD...
详细信息
We present our transducer model on Librispeech. We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM. This is justified by a Bayesian interpretation wh...
详细信息
Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to comp...
详细信息
The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowled...
详细信息
Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle t...
详细信息
We propose simple architectural modifications in the standard Transformer with the goal to reduce its total state size (defined as the number of self-attention layers times the sum of the key and value dimensions, tim...
ISBN:
(数字)9781509066315
ISBN:
(纸本)9781509066322
We propose simple architectural modifications in the standard Transformer with the goal to reduce its total state size (defined as the number of self-attention layers times the sum of the key and value dimensions, times position) without loss of performance. Large scale Transformer language models have been empirically proved to give very good performance. However, scaling up results in a model that needs to store large states at evaluation time. This can increase the memory requirement dramatically for search e.g., in speech recognition (first pass decoding, lattice rescoring, or shallow fusion). In order to efficiently increase the model capacity without increasing the state size, we replace the single-layer feed-forward module in the Transformer layer by a deeper network, and decrease the total number of layers. In addition, we also evaluate the effect of key-value tying which directly divides the state size in half. On TED-LIUM 2, we obtain a model of state size 4 times smaller than the standard Transformer, with only 2% relative loss in terms of perplexity, which makes the deployment of Transformer language models more convenient.
Training deep neural networks is often challenging in terms of training stability. It often requires careful hyperparameter tuning or a pretraining scheme to converge. Layer normalization (LN) has shown to be a crucia...
ISBN:
(数字)9781509066315
ISBN:
(纸本)9781509066322
Training deep neural networks is often challenging in terms of training stability. It often requires careful hyperparameter tuning or a pretraining scheme to converge. Layer normalization (LN) has shown to be a crucial ingredient in training deep encoder-decoder models. We explore various LN long short-term memory (LSTM) recurrent neural networks (RNN) variants by applying LN to different parts of the internal recurrency of LSTMs. There is no previous work that investigates this. We carry out experiments on the Switchboard 300h task for both hybrid and end-to-end ASR models and we show that LN improves the final word error rate (WER), the stability during training, allows to train even deeper models, requires less hyperparameter tuning, and works well even without pre-training. We find that applying LN to both forward and recurrent inputs globally, which we denoted by Global Joined Norm variant, gives a 10% relative improvement in WER.
This paper studies the practicality of the current state-of-the-art unsupervised methods in neural machine translation (NMT). In ten translation tasks with various data settings, we analyze the conditions under which ...
详细信息
Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tyin...
详细信息
暂无评论