Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature sel...
详细信息
ISBN:
(纸本)9781457705380
Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature selection algorithm with ReliefF - an efficient multivariate algorithm. An alternative to feature selection is ℓ 1 -regularized training, which leads to sparse models. We observe that this gives no speedup when sparse features are used, hence feature selection methods are preferable. For dense features, ℓ 1 -regularization can reduce training and recognition time. We generalize the well known Rprop algorithm for the optimization of ℓ 1 -regularized functions. Experiments on the Wall Street Journal corpus showed that a large number of sparse features could be discarded without loss of performance. A strong regularization led to slight performance degradations, but can be useful on large tasks, where training the full model is not tractable.
The use of statically compiled search networks for ASR systems using huge vocabularies and complex language models often becomes challenging in terms of memory requirements. Dynamic network decoders introduce addition...
详细信息
ISBN:
(纸本)9781457705380
The use of statically compiled search networks for ASR systems using huge vocabularies and complex language models often becomes challenging in terms of memory requirements. Dynamic network decoders introduce additional computations in favor of significantly lower memory consumption. In this paper we investigate the properties of two well-known search strategies for dynamic network decoding, namely history conditioned tree search and WFST-based search using dynamic transducer composition. We analyze the impact of the differences in search graph representation, search space structure, and language model look-ahead techniques. Experiments on an LVCSR task illustrate the influence of the compared properties.
We have recently proposed an EM-style algorithm to optimize log-linear models with hidden variables. In this paper, we use this algorithm to optimize a hidden conditional random field, i.e., a conditional random field...
详细信息
We have recently proposed an EM-style algorithm to optimize log-linear models with hidden variables. In this paper, we use this algorithm to optimize a hidden conditional random field, i.e., a conditional random field with hidden variables. Similar to hidden Markov models, the alignments are the hidden variables in the examples considered. Here, EM-style algorithms are iterative optimization algorithms which are guaranteed to improve the training criterion in each iteration without the need for tuning step sizes, sophisticated update schemes or numerical line optimization (with hardly predictable complexity). This is a rather strong property which conventional gradient-based optimization algorithms do not have. We present experimental results for a grapheme-to-phoneme conversion task and compare the convergence behavior of the EM-style algorithm with L-BFGS and Rprop.
Conditional Random Fields (CRFs) are a state-of-the-art approach to natural language processing tasks like grapheme-to phoneme (g2p) conversion which is used to produce pronunciations or pronunciation variants for alm...
详细信息
Conditional Random Fields (CRFs) are a state-of-the-art approach to natural language processing tasks like grapheme-to phoneme (g2p) conversion which is used to produce pronunciations or pronunciation variants for almost all ASR pronunciation lexica. One drawback of CRFs is that for training, an alignment is needed between graphemes and phonemes, usually even 1-to-l. The quality of the g2p result heavily depends on this alignment. Since these alignments are usually not annotated within the corpora, external models have to be used to produce such an alignment in a preprocessing step. In this work, we propose two approaches to integrate the alignment generation directly and efficiently into the CRF training process. Whereas the first approach relies on linear segmentation as starting point, the second approach considers all possible alignments given certain constraints. Both methods have been evaluated on two English g2p tasks, namely NETtalk and Celex, on which state-of-the-art results have been reported in the literature. The proposed approaches lead to results comparable to the state-of-the art.
Does using our hands help us to add the value of a set of coins? We test the benefits and costs of direct interaction with a mental arithmetic task in a computerized yoked design in which groups of participants vary i...
详细信息
We propose to improve speech recognition performance on speaker-independent, mixed language speech by asymmetric acoustic modeling. Mixed language is either inter-sentential code switching from the source matrix langu...
详细信息
We propose to improve speech recognition performance on speaker-independent, mixed language speech by asymmetric acoustic modeling. Mixed language is either inter-sentential code switching from the source matrix language to a foreign language or intra-sentential code mixing between the matrix language and embedded foreign words or phrases. In either case, the foreign phrases are pronounced by the matrix language speaker with varying degrees of accent. Our proposed system using selective decision tree merging between a bilingual model and an accented embedded speech model outperforms previous approaches of either using a bilingual model with model retraining by 21.51percent, or using adaptation by 15.88percent. It outperforms all models on both code mixing and code switching cases. We successfully improved recognition on embedded foreign speech without degrading the performance on the matrix language speech.
The length of the test speech greatly influences the performance of GMM-UBM based text-independent speaker recognition system, for example when the length of valid speech is as short as 1~5 seconds, the performance d...
详细信息
The length of the test speech greatly influences the performance of GMM-UBM based text-independent speaker recognition system, for example when the length of valid speech is as short as 1~5 seconds, the performance decreases significantly because the GMM-UBM based speaker recognition method is a statistical one, of which sufficient data is the foundation. Considering that the use of text information will be helpful to speaker recognition, a multi-model method is proposed to improve short-utterance speaker recognition (SUSR) in Chinese. We build a few phoneme class models for each speaker to represent different parts of the characteristic space and fuse the scores to fit the test data on the models with the purpose of increasing the matching degree between training models and test utterance. Experimental results showed that the proposed method achieved a relative EER reduction of about 26% compared with the traditional GMM-UBM method.
作者:
Thomas Fang ZhengCenter for Speech and Language Technologies
Division of Technical Innovation and Development Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology Tsinghua University
Speaker clustering is an important step in multispeaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker spee...
详细信息
Speaker clustering is an important step in multispeaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker speech segments after segmentation is, the worse performance of the following speaker recognition will be achieved, therefore a reasonable solution to better multi-speaker detection performance is to enlarge the average length of after-segmentation single-speaker speech segments, which is equivalently to cluster as many true samespeaker segments into one as possible. In other words, the average class purity of each speaker segment should be as bigger as possible. Accordingly, a speaker-clustering algorithm based on the class purity criterion is proposed, where a Reference Speaker Model (RSM) scheme is adopted to calculate the distance between speech segments, and the maximal class purity, or equivalently the minimal within-class dispersion, is taken as the criterion. Experiments on the NIST SRE 2006 database showed that, compared with the conventional Hierarchical Agglomerative Clustering (HAC) algorithm, for speech segments with average lengths of 2 seconds, 5 seconds and 8 seconds, the proposed algorithm increased the valid class speech length by 2.7%, 3.8% and 4.6%, respectively, and finally the target speaker detection recall rate was increased by 7.6%, 6.2% and 5.1%, respectively.
作者:
Thomas Fang ZhengCenter for Speech and Language Technologies
Division of Technical Innovation and Development Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology Tsinghua University
Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. ...
详细信息
Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. However, it is costly, user-unfriendly, and sometimes, perhaps unrealistic, which hinders the technology from practical applications. From a pattern recognition point of view, the time-varying issue in speaker recognition requires such features that are speakerspecific, and as stable as possible across time-varying sessions. Therefore, after searching and analyzing the most stable parts of feature space, a Discrimination-emphasized Mel-frequencywarping method is proposed. In implementation, each frequency band is assigned with a discrimination score, which takes into account both speaker and session information, and Melfrequency-warping is done in feature extraction to emphasize bands with higher scores. Experimental results show that in the time-varying voiceprint database, this method can not only improve speaker recognition performance with an EER reduction of 19.1%, but also alleviate performance degradation brought by time varying with a reduction of 8.9%.
暂无评论