Recently end-to-end speech recognition has obtained much attention. One of the popular models to achieve end-to-end speech recognition is attention based encoder-decoder model, which usually generating output sequence...
详细信息
ISBN:
(纸本)9781510848764
Recently end-to-end speech recognition has obtained much attention. One of the popular models to achieve end-to-end speech recognition is attention based encoder-decoder model, which usually generating output sequences iteratively by attending the whole representations of the input sequences. However. predicting outputs until receiving the whole input sequence is not practical for online or low time latency speech recognition. In this paper, we present a simple but effective attention mechanism which can make the encoder-decoder model generate outputs without attending the entire input sequence and can apply to online speech recognition. At each prediction step, the attention is assumed to be a time-moving gaussian window with variable size and can be predicted by using previous input and output information instead of the content based computation on the whole input sequence. To further improve the online performance of the model, we employ deepconvolutional neural networks as encoder. Experiments show that the gaussian prediction based attention works well and under the help of deepconvolutional neural networks the online model achieves 19.5% phoneme error rate in TIMIT ASR task.
暂无评论