To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via lo...
详细信息
Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (AD...
详细信息
We present our transducer model on Librispeech. We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM. This is justified by a Bayesian interpretation wh...
详细信息
Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to comp...
详细信息
The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpr...
详细信息
With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like ...
详细信息
The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowled...
详细信息
Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle t...
详细信息
Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning *** vital challenge in speech e...
详细信息
Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning *** vital challenge in speech emotion recognition(SER)is learning robust and discriminative representations from *** machine learning methods have been widely applied in SER research,the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques(e.g.,deep neural networks).To address this issue,we present a deep learning method that combines knowledge transfer and self-attention for SER ***,we apply the log-Mel spectrogram with deltas and delta-deltas as ***,given that emotions are time dependent,we apply temporal convolutional neural networks to model the variations in *** further introduce an attention transfer mechanism,which is based on a self-attention algorithm to learn long-term *** self-attention transfer network(SATN)in our proposed approach takes advantage of attention transfer to learn attention from speech recognition,followed by transferring this knowledge into *** evaluation built on Interactive Emotional Dyadic Motion Capture(IEMOCAP)dataset demonstrates the effectiveness of the proposed model.
Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stab...
详细信息
暂无评论