MLP based front-ends have shown significant complementary properties to conventional spectral features. As part of the DARPA GALE program, different MLP features were developed for Mandarin ASR. In this paper, all the...
详细信息
Dear editor,Although face-sketch synthesis generates a sketch from a given face photo automatically [1], it is an open research problem in computer vision [2–4]. Recently, several deep neural network (DNN)methods for...
详细信息
Dear editor,Although face-sketch synthesis generates a sketch from a given face photo automatically [1], it is an open research problem in computer vision [2–4]. Recently, several deep neural network (DNN)methods for face-sketch synthesis have been proposed with considerable results.
In current speech recognition systems mainly Short-Time Fourier Transform based features like MFCC are applied. Dropping the short-time stationarity assumption of the voiced speech, this paper introduces the non-stati...
详细信息
In current speech recognition systems mainly Short-Time Fourier Transform based features like MFCC are applied. Dropping the short-time stationarity assumption of the voiced speech, this paper introduces the non-stationary signal analysis into the ASR framework. We present new acoustic features extracted by a pitch-adaptive Gammatone filter bank. The noise robustness was proved on AURORA 2 and 4 tasks, where the proposed features outperform the standard MFCC. Furthermore, successful combination experiments via ROVER indicate the differences between the new features and MFCC.
In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes' theorem. One component is a traditional ungrounded response generatio...
On dedicated websites, people can upload videos and share it with the rest of the world. Currently these videos are categorized manually by the help of the user community. In this paper, we propose a combination of co...
详细信息
We present a new architecture and a training strategy for an adaptive mixture of experts with applications to domain robust language modeling. The proposed model is designed to benefit from the scenario where the trai...
详细信息
ISBN:
(纸本)9781538646595
We present a new architecture and a training strategy for an adaptive mixture of experts with applications to domain robust language modeling. The proposed model is designed to benefit from the scenario where the training data are available in diverse domains as is the case for YouTube speech recognition. The two core components of our model are an ensemble of parallel long short-term memory (LSTM) expert layers for each domain and another LSTM based network which generates state dependent mixture weights for combining expert LSTM states by linear interpolation. The resulting model is a recurrent adaptive mixture model (RADMM) of domain experts. We train our model on 4.4B words from YouTube speech recognition data. We report results on the YouTube speech recognition test set. Compared with a background LSTM model, we obtain up to 12% relative improvement in perplexity and an improvement in word error rate from 12.3% to 12.1% while using a lattice rescoring with strong pruning.
In this work, we present novel warping algorithms for full 2D pixel-grid deformations for face recognition. Due to high variation in face appearance, face recognition is considered a very difficult task, especially if...
详细信息
In this work, we present novel warping algorithms for full 2D pixel-grid deformations for face recognition. Due to high variation in face appearance, face recognition is considered a very difficult task, especially if only a single reference image, for example a mug-shot, per face is available. Usually model-based approaches with additional training data are used to cope with several types of variation occurring in facial imaging. Image warping contrarily yields a distance measure which is invariant with regard to several types of variation. This allows for precise recognition even using only very few reference observations. Due to the computationally complex problem of optimal 2D warping, pseudo-2D warping-based approaches in the past represented strong approximations of the original problem, and were mainly successful on data with low variability or rectified images. We propose a novel 2D warping method which is globally optimal and makes no prior assumtions on the data variability besides two-dimensional smootheness constraints which both avoid local mirroring and gaps and significantly speed up the optimization. Furthermore, we show that occlusion handling is imperative to obtain smooth warpings in a variety of domains. We evaluate our novel algorithm on various well known databases, such as the AR-Face and CMU-PIE database, and provide a detailed comparison to existing warping approaches. We show that by using simple relative 2D constraints, strong local features and a kernel, which is robust w.r.t. occlusions, our computationally complex approaches outperform state-of-the-art results for recognizing faces under varying expressions, occlusions and poses. Most interestingly, we achieve higher accuracy using fewer training instances per class compared to methods learning a model of the 3D shape.
Sentiment analysis, mostly based on text, has been rapidly developing in the last decade and has attracted widespread attention in both academia and industry. However, information in the real world usually comes from ...
详细信息
We present an analysis of music modeling and recognition techniques in the context of mobile music matching, substantially improving on the techniques presented in [1]. We accomplish this by adapting the features spec...
详细信息
We present an analysis of music modeling and recognition techniques in the context of mobile music matching, substantially improving on the techniques presented in [1]. We accomplish this by adapting the features specifically to this task, and by introducing new modeling techniques that enable using a corpus of noisy and channel-distorted data to improve mobile music recognition quality. We report the results of an extensive empirical investigation of the system's robustness under realistic channel effects and distortions. We show an improvement of recognition accuracy by explicit duration modeling of music phonemes and by integrating the expected noise environment into the training process. Finally, we propose the use of frame-to-phoneme alignment for high-level structure analysis of polyphonic music.
暂无评论