Code-switching, the combination of more than one language within a single utterance is popular in social media sites for informal communication, generating a huge amount of data, though not fully analysed for knowledg...
详细信息
Code-switching, the combination of more than one language within a single utterance is popular in social media sites for informal communication, generating a huge amount of data, though not fully analysed for knowledge extraction due to lack of documentation. Moreover, the code-mixed data is erroneous, and to detect and correct different types of errors, the model requires large erroneous code-mixed language data, not publicly available. This paper first defines generic rules to write Bengali-English code-mixed language in English script, considering the inherent complexity of the Bengali language. Different types of typographical and cognitive errors are induced to obtain the huge erroneous data, based on human behaviour and perception as consulted with the language experts. The errors considered here are applicable to other code-mixed Indic languages and would be beneficial for the researchers. To demonstrate the applicability of the model, we have included the errors in the Hindi-English code-mixed Indic language. An attention-based two-level deep network architecture (uses LSTM as a basic unit) is used for error detection, correction, and translation of code-mixed sentences into a monolingual sentence. Results are reported in terms of accuracy, ROUGE score, and BLEU scores at word level and sentence level for both Bengali-English and Hindi-English code-mixed languages.
Recently, attention-basedencoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inferen...
详细信息
Recently, attention-basedencoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention approaches is that they introduce an additional hyperparameter related to the length of the attention window, requiring multiple trials of model training for tuning the hyperparameter. In order to deal with this problem, we propose a novel softmax-free attention method and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. Through a number of ASR experiments, we demonstrate the tradeoff between the latency and performance of the proposed online attention technique can be controlled by merely adjusting a threshold at the test phase. Furthermore, the proposed methods showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs).
Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training ...
详细信息
Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-basedmodel are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.
Acoustic-to-word speech recognition based on attention-based encoder-decoder models achieves better accuracies with much lower latency than the conventional speech recognition systems. However, acoustic-to-word models...
详细信息
ISBN:
(纸本)9781510872219
Acoustic-to-word speech recognition based on attention-based encoder-decoder models achieves better accuracies with much lower latency than the conventional speech recognition systems. However, acoustic-to-word models require a very large amount of training data and it is difficult to prepare one for a new domain such as elderly speech. To address the problem, we propose domain adaptation based on transfer learning with layer freezing. Layer freezing first pre-trains a network with the source domain data, and then a part of parameters is re-trained for the target domain while the rest is fixed. In the attentionbased acoustic-to-word model, the encoder part is frozen to maintain the generality, and only the decoder part is re-trained to adapt to the target domain. This substantially allows for adaptation of the latent linguistic capability of the decoder to the target domain. Using a large-scale Japanese spontaneous speech corpus as source, the proposed method is applied to three target domains: a call center task and two voice search tasks by adults and by elderly. The models trained with the proposed method achieved better accuracy than the baseline models, which are trained from scratch or entirely re-trained with the target domain.
暂无评论