We address the memory problem of maximum entropy language models (MELM) with very large feature sets. Randomized techniques are employed to remove all large, exact data structures in MELM implementations. To avoid the...
详细信息
We address the memory problem of maximum entropy language models (MELM) with very large feature sets. Randomized techniques are employed to remove all large, exact data structures in MELM implementations. To avoid the dictionary structure that maps each feature to its corresponding weight, the feature hashing trick [1] [2] can be used. We also replace the explicit storage of features with a Bloom filter. We show with extensive experiments that false positive errors of Bloom filters and random hash collisions do not degrade model performance. Both perplexity and WER improvements are demonstrated by building MELM that would otherwise be prohibitively large to estimate or store.
作者:
Thomas Fang ZhengCenter for Speech and Language Technologies
Division of Technical Innovation and Development Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology Tsinghua University
Speaker clustering is an important step in multispeaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker spee...
详细信息
Speaker clustering is an important step in multispeaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker speech segments after segmentation is, the worse performance of the following speaker recognition will be achieved, therefore a reasonable solution to better multi-speaker detection performance is to enlarge the average length of after-segmentation single-speaker speech segments, which is equivalently to cluster as many true samespeaker segments into one as possible. In other words, the average class purity of each speaker segment should be as bigger as possible. Accordingly, a speaker-clustering algorithm based on the class purity criterion is proposed, where a Reference Speaker Model (RSM) scheme is adopted to calculate the distance between speech segments, and the maximal class purity, or equivalently the minimal within-class dispersion, is taken as the criterion. Experiments on the NIST SRE 2006 database showed that, compared with the conventional Hierarchical Agglomerative Clustering (HAC) algorithm, for speech segments with average lengths of 2 seconds, 5 seconds and 8 seconds, the proposed algorithm increased the valid class speech length by 2.7%, 3.8% and 4.6%, respectively, and finally the target speaker detection recall rate was increased by 7.6%, 6.2% and 5.1%, respectively.
作者:
Thomas Fang ZhengCenter for Speech and Language Technologies
Division of Technical Innovation and Development Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology Tsinghua University
Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. ...
详细信息
Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. However, it is costly, user-unfriendly, and sometimes, perhaps unrealistic, which hinders the technology from practical applications. From a pattern recognition point of view, the time-varying issue in speaker recognition requires such features that are speakerspecific, and as stable as possible across time-varying sessions. Therefore, after searching and analyzing the most stable parts of feature space, a Discrimination-emphasized Mel-frequencywarping method is proposed. In implementation, each frequency band is assigned with a discrimination score, which takes into account both speaker and session information, and Melfrequency-warping is done in feature extraction to emphasize bands with higher scores. Experimental results show that in the time-varying voiceprint database, this method can not only improve speaker recognition performance with an EER reduction of 19.1%, but also alleviate performance degradation brought by time varying with a reduction of 8.9%.
The length of the test speech greatly influences the performance of GMM-UBM based text-independent speaker recognition system, for example when the length of valid speech is as short as 1~5 seconds, the performance d...
详细信息
The length of the test speech greatly influences the performance of GMM-UBM based text-independent speaker recognition system, for example when the length of valid speech is as short as 1~5 seconds, the performance decreases significantly because the GMM-UBM based speaker recognition method is a statistical one, of which sufficient data is the foundation. Considering that the use of text information will be helpful to speaker recognition, a multi-model method is proposed to improve short-utterance speaker recognition (SUSR) in Chinese. We build a few phoneme class models for each speaker to represent different parts of the characteristic space and fuse the scores to fit the test data on the models with the purpose of increasing the matching degree between training models and test utterance. Experimental results showed that the proposed method achieved a relative EER reduction of about 26% compared with the traditional GMM-UBM method.
In this paper, we present an in-car Chinese noise corpus that can be used in simulating complicated car environment for robust speech recognition research and experiment. The corpus was collected in mainland China in ...
详细信息
In this paper, we present an in-car Chinese noise corpus that can be used in simulating complicated car environment for robust speech recognition research and experiment. The corpus was collected in mainland China in 2009 and 2010. The corpus includes a diversity of car conditions including different car speed, open/close windows, weather conditions as well as environment conditions. Specially, the rumble strips are also taken into account due to the typical noise generated as the car is passing on. In order to use the corpus efficiently, we performed some acoustic signal analyses on those noise data, mainly focused on stationary properties and energy distribution in the frequency domain. We also performed ASR experiments using selected noise data from the corpus, by adding noise data to clean speech to simulate the in-car environment. The corpus is the first of its kind for in-car Chinese noise corpus, providing abundant and diversified samples for car noise speech recognition task.
With billions of triples in the Linked Open Data cloud, which continues to grow exponentially, very challenging tasks begin to emerge related to the exploitation of large-scale reasoning. A considerable amount of work...
详细信息
With billions of triples in the Linked Open Data cloud, which continues to grow exponentially, very challenging tasks begin to emerge related to the exploitation of large-scale reasoning. A considerable amount of work has been done in the area of using Information Retrieval methods to address these problems. However, although applied models work on Web scale, they downgrade the semantics contained in an RDF graph by observing each physical resource as a 'bag of words (URIs/literals)'. Distributional statistic methods can address this problem by capturing the structure of the graph more efficiently. However, these methods are continually confronting with efficiency and scalability problems on serial computing architectures due to their computational complexity. In this paper, we describe a parallelization algorithm of one such method (Random Indexing) based on the Message-Passing Interface (MPI), that enables efficient utilization of high performance parallel computers. Our evaluation results show significant performance improvement.
Multiple accents are often present in Mandarin speech, as most Chinese have learned Mandarin as a second language. We propose generating reliable accent specific unit together with dynamic Gaussian mixture selection f...
详细信息
Multiple accents are often present in Mandarin speech, as most Chinese have learned Mandarin as a second language. We propose generating reliable accent specific unit together with dynamic Gaussian mixture selection for multi-accent speech recognition. Time alignment phoneme recognition is used to generate such unit and to model accent variations explicitly and accurately. Dynamic Gaussian mixture selection scheme builds a dynamical observation density for each specified frame in decoding, and leads to use Gaussian mixture component efficiently. This method increases the covering ability for a diversity of accent variations in multi-accent, and alleviates the performance degradation caused by pruned beam search without augmenting the model size. The effectiveness of this approach is evaluated on three typical Chinese accents Chuan, Yue and Wu. Our approach outperforms traditional acoustic model reconstruction approach significantly by 6.30%, 4.93% and 5.53%, respectively on Syllable Error Rate (SER) reduction, without degrading on standard speech.
Usually ambiguous words contained in article appear several times. Almost all existing methods for unsupervised word sense disambiguation make use of information contained only in ambiguous sentence. This paper presen...
详细信息
暂无评论