Keyword spotting in a continuous speech is a challenging problem and has relevance in applications like audio indexing and music retrieval. In this work, the problem of keyword spotting is addressed by utilizing the c...
详细信息
Keyword spotting in a continuous speech is a challenging problem and has relevance in applications like audio indexing and music retrieval. In this work, the problem of keyword spotting is addressed by utilizing the complementary information present in spectral and prosodic features of the speech signal. A thorough analysis of the complementary information is performed on a large Hindi language database developed for this purpose. Phonetic and prosodic distribution analysis is performed toward this end, using canonical correlation and Student T-distance function. Motivated by these analyses, novel methods for spectral and prosodic information fusion that optimize a combined error function is proposed. The fusion methods are developed both at the feature and the model level. Improved syllable sequence prediction and keyword spotting performance are obtained using these methods when compared to conventional methods of keyword spotting. Additionally, in order to enable comparison with the state-of-the-art deep learning-based methods, a novel method for improved syllable sequence prediction using deep denoising autoencoders is proposed. The performance of the methods proposed in this work is evaluated for keyword spotting using a syllable sliding protocol over a large Hindi database. Reasonable performance improvements are noted from the experimental results on syllable sequence prediction, keyword spotting, and audio retrieval.
Background noise is a critical issue for hearing aid device users;a common solution to address this problem is speech enhancement (SE). In recent times, a novel SE approach based on deep learning technology, called de...
详细信息
Background noise is a critical issue for hearing aid device users;a common solution to address this problem is speech enhancement (SE). In recent times, a novel SE approach based on deep learning technology, called deep denoising autoencoder (DDAE), has been proposed. Previous studies show that the DDAE SE approach provides superior noise suppression capabilities and produces less distortion than any of the classical SE approaches in the case of processed speech. Motivated by the improved results using DDAE shown in previous studies, we propose the multi-objective learning-based DDAE (M-DDAE) SE approach in this study;in addition, we evaluated its speech quality and intelligibility improvements using seven typical hearing loss audiograms. The experimental results of our objective evaluations show that our M-DDAE approach achieved significantly better results than the DDAE approach in most test conditions. Considering this, the proposed M-DDAE SE approach can be potentially used to further improve the listening performance of hearing aid devices in noisy conditions. (C) 2018 Elsevier Ltd. All rights reserved.
Reverberation, which is generally caused by sound reflections from walls, ceilings, and floors, can result in severe performance degradation of acoustic applications. Due to a complicated combination of attenuation an...
详细信息
ISBN:
(纸本)9781538646595
Reverberation, which is generally caused by sound reflections from walls, ceilings, and floors, can result in severe performance degradation of acoustic applications. Due to a complicated combination of attenuation and time-delay effects, the reverberation property is difficult to characterize, and it remains a challenging task to effectively retrieve the anechoic speech signals from reverberation ones. In the present study, we proposed a novel integrated deep and ensemble learning algorithm (IDEA) for speech dereverberation. The IDEA consists of offline and online phases. In the offline phase, we train multiple dereverberation models, each aiming to precisely dereverb speech signals in a particular acoustic environment;then a unified fusion function is estimated that aims to integrate the information of multiple dereverberation models. In the online phase, an input utterance is first processed by each of the dereverberation models. The outputs of all models are integrated accordingly to generate the final anechoic signal. We evaluated the IDEA on designed acoustic environments, including both matched and mismatched conditions of the training and testing data. Experimental results confirm that the proposed IDEA outperforms single deep-neural-network-based dereverberation model with the same model architecture and training data.
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large a...
详细信息
ISBN:
(纸本)9781479999880
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large amounts of speech data. Such archived data will be helpful for acoustic/language modeling in live-broadcast captioning based on automatic speech recognition (ASR). In Japanese broadcasts, however, archived programs, e.g., sports news, do not always have closed-captions used typically as references. Thus, unsupervised adaptation techniques are needed for performance improvements even when a DNN is used as an acoustic model. In this paper, we compared three unsupervised sequence adaptation techniques: maximum a posteriori (MAP), entropy minimization, and Bayes risk minimization. Experimental results for transcribing sports news programs showed that the best ASR performance is brought about by Bayes risk minimization which reflects information as to expected errors, while comparable results are obtained with MAP, the simplest way of unsupervised sequence adaptation.
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large a...
详细信息
ISBN:
(纸本)9781479999897
This paper compares unsupervised sequence training techniques for deep neural networks (DNN) for broadcast transcriptions. Recent progress in digital archiving of broadcast content has made it easier to access large amounts of speech data. Such archived data will be helpful for acoustic/language modeling in live-broadcast captioning based on automatic speech recognition (ASR). In Japanese broadcasts, however, archived programs, e.g., sports news, do not always have closed-captions used typically as references. Thus, unsupervised adaptation techniques are needed for performance improvements even when a DNN is used as an acoustic model. In this paper, we compared three unsupervised sequence adaptation techniques: maximum a posteriori (MAP), entropy minimization, and Bayes risk minimization. Experimental results for transcribing sports news programs showed that the best ASR performance is brought about by Bayes risk minimization which reflects information as to expected errors, while comparable results are obtained with MAP, the simplest way of unsupervised sequence adaptation.
暂无评论