Prosody structure prediction plays an important role in text-to-speech (ITS) conversion systems, ft is the must and prior step to parametric prosody prediction. Dynamic programming (DP) and decision tree (DT) are wide...
详细信息
Prosody structure prediction plays an important role in text-to-speech (ITS) conversion systems, ft is the must and prior step to parametric prosody prediction. Dynamic programming (DP) and decision tree (DT) are widely used for prosody structure prediction [1][2][3] but with well-known limitations. In this paper, two other new methods, combination of dynamic programming with decision tree and combination of decision tree with finite state machine (FSM), are proposed. Then, based on a manually labeled corpus, comprehensive comparisons among the four methods are done. It could be concluded from these experiments that combination of dynamic programming with decision tree method is the best choice for prosody word boundary prediction and combination of decision tree with FSM is the best candidate for prosody phrase boundary prediction.
As Chinese is well known to be a tonal language, how to apply tonal information is a special issue in Mandarin ASR. In this paper we investigate three different strategies of integrating tonal information based on a m...
详细信息
As Chinese is well known to be a tonal language, how to apply tonal information is a special issue in Mandarin ASR. In this paper we investigate three different strategies of integrating tonal information based on a multi-stream model. An evaluation is made for a Chinese name recognition task, in which the vocabulary is composed of 100 tonally confusing pairs. Our experimental results show that a significant improvement in the recognition accuracy can be achieved with any combination strategy, and a maximal reduction of 83.2% in word error rate (WER) is obtained by hypothesis combination. Generally, hypothesis combination seems to be the best solution in both recognition accuracy and flexibility for isolated word speech recognition.
We describe new methods for continuous putonghua speech recognition. We have augmented the IBM HMM-based continuous speech recognition system with the following features: First, we treat tones in putonghua as attribu...
详细信息
We describe new methods for continuous putonghua speech recognition. We have augmented the IBM HMM-based continuous speech recognition system <1-3> with the following features: First, we treat tones in putonghua as attributes of certain phonemes, instead of syllables. We call those phonemes with tone tonemes. Second, instantaneous pitch is treated as a variable in the acoustic feature vector, in the same way as cepstra or energy. Third, by designing a set of word-segmentation rules to convert the continuous Chinese text into segmented text, the trigram language model works effectively. By applying those new methods, a speaker-independent, very-large-vocabulary continuous putonghua dictation system can be constructed.
A new method of speaker segmentation based on model scoring is proposed in this paper. This method consists of three phases: (1) pre-segmentation, (2) model scoring, and (3) clustering. In the first phase, we divide t...
详细信息
A new method of speaker segmentation based on model scoring is proposed in this paper. This method consists of three phases: (1) pre-segmentation, (2) model scoring, and (3) clustering. In the first phase, we divide the utterances into small fragments, assuming that in each part there is only one speaker. In the model scoring phase, the spectrum features extracted from those small speech fragments are mapped to score features by calculating the likelihood over a series of cohort speaker models. In the third phase, we merge the parts which have the smallest distance between each other to form the utterance for one speaker. After that, higher accuracy can be achieved if resegmentation is performed. Results from the experiments in this paper demonstrate that this method indeed produces satisfying accuracy on speaker segmentation.
We studied FO reset and FO range change in relation to a hierarchical multiple-phrase prosody framework for fluent speech that accounts for cumulative fluent speech output. Both FO reset FO range modifications were an...
详细信息
We studied FO reset and FO range change in relation to a hierarchical multiple-phrase prosody framework for fluent speech that accounts for cumulative fluent speech output. Both FO reset FO range modifications were analyzed with respect to levels of boundary breaks within and across phrases. We found that FO reset occurred after higher levels of boundary breaks;FO range may reduce for transitional phrases that occurred at the beginning of a speech paragraph, causing the first FO reset to move forward to the next phrase. We also found gender related differences in speaking style and overall global FO contour patterns. The findings are instrumental to enhance both our prosody framework and application to synthesis output.
Gaussian Markov model (GMM) has been widely used in the area of speaker recognition. It can be considered as a one-state hidden Markov model (HMM) with multiple mixtures. As for the HMM, maximum-likelihood (ML) estima...
详细信息
Gaussian Markov model (GMM) has been widely used in the area of speaker recognition. It can be considered as a one-state hidden Markov model (HMM) with multiple mixtures. As for the HMM, maximum-likelihood (ML) estimation is considered a good choice of training approach for the GMM-based speaker identification. However it only considers the likelihood of a single speaker. That is, each speaker model is estimated separately using its labeled training utterances. To compare the likelihood against those similar speakers and maximizes their likelihood differences, another training algorithm named maximum model distance (MMD) is proposed for GMM in this paper. Experimental results from the TI46-Word and TIMIT database show that it is an attractive alternative for the GMM training.
This paper describes the development and collection as well as initial analysis of the EARS (Effective, Affordable, Reusable Speech-to-text) Chinese telephony speech corpus (EARSCTS). The corpus contains 1206 ten-minu...
详细信息
This paper describes the development and collection as well as initial analysis of the EARS (Effective, Affordable, Reusable Speech-to-text) Chinese telephony speech corpus (EARSCTS). The corpus contains 1206 ten-minute natural Mandarin conversations between either strangers or friends. The total amount of the corpus is 200 hours. There are 40 topics in all conversations and each conversation focuses on a single topic. All the speech data are recorded over public telephone networks, e.g., landline and cellular channels. All the speech data are annotated manually with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech, time alignment information is also provided. This corpus can be used for conversational and spontaneous Mandarin speech recognition and other application-dependent tasks. The EARSCTS corpus is the largest and first of its kind for Mandarin conversational telephony speech, providing sufficient and diversified samples needed for speech training, testing, adaptation and development.
This paper provides explanation and insight regarding the role of pitch information in text-independent speaker verification by experiments. The paper first brief a robust fundamental frequency extraction method. Then...
详细信息
This paper provides explanation and insight regarding the role of pitch information in text-independent speaker verification by experiments. The paper first brief a robust fundamental frequency extraction method. Then, a channel robust text-independent speaker verification based on GMM-UBM model which fuse the extracted pitch information and acoustic features is described in detail. Experiments are conducted using the NIST 1999 Speaker Recognition Evaluation corpus. Experimental results are presented show that the pitch features are more robust to handset variability and noise, but are not very robust to other factors like emotions. Therefore, how to effectively utilize the pitch information in speaker recognition is still a challenge for researchers.
Hierarchical recognition has been proposed for a long time in the pattern recognition field. Although it is a familiar action when human performs a recognition task, there is not an effective and systematic method to ...
详细信息
Hierarchical recognition has been proposed for a long time in the pattern recognition field. Although it is a familiar action when human performs a recognition task, there is not an effective and systematic method to implement it for the speech recognition. This paper presents our recent experimental results on this topic, which uses the principle of sub-space partition to realize a hierarchical recogntion and a tree-based architecture to organize multi-recognizers. Although these preliminary evaluations are carried out with an isolated word recognition system for 10 Chinese digital syllables, we believe that it is easy to extend them for the spontaneous speech recognition. The results show that the proposed algorithm can achieve about 10% error reduction compared with traditional methods. In future works, we will test all Chinese syllables and extend them for the continous speech recogntion.
The amount of digital audio and video documents being shot and stored in large archives is growing faster than ever before. However, lack of effective content searching methods is a major barrier that prevents people ...
详细信息
The amount of digital audio and video documents being shot and stored in large archives is growing faster than ever before. However, lack of effective content searching methods is a major barrier that prevents people from operating audio and video databases pervasively and intelligently. This paper reviews the state-of-the-art in audio information retrieval and presents an audio searching solution that uses statistical pattern matching techniques. The two-tier speech retrieval solution proposed in this paper, named Audio Search, is composed of a fast searching component followed by an ends-free Viterbi keyword spotter combined with a histogram-based phoneme duration model. Its performances have been evaluated on the 1997 HUB4 database for Chinese broadcast news. In comparison with the results of large vocabulary automatic speech recognition based systems, the results we obtained with the Audio Search system show significant improvements for both long queries (typically 5 to 7 syllables) and out-of-vocabulary (OOV) queries.
暂无评论