We report an experiment in which a high-performance boosting based NER model originally designed for multiple European languages is instead applied to the Chinese named entity recognition task of the third SIGHAN Chin...
详细信息
We present twin results on Chinese semantic parsing, with application to English-Chinese cross- lingual verb frame acquisition. First, we describe two new state-of-the-art Chinese shallow semantic parsers leading to a...
详细信息
We present twin results on Chinese semantic parsing, with application to English-Chinese cross- lingual verb frame acquisition. First, we describe two new state-of-the-art Chinese shallow semantic parsers leading to an F-score of 82.01 on simultaneous frame and argument boundary identification and labeling. Subsequently, we propose a model that applies the separate Chinese and English semantic parsers to learn cross-lingual semantic verb frame argument mappings with 89.3% accuracy. The only training data needed by this cross-lingual learning model is a pair of non-parallel monolingual Propbanks, plus an unannotated parallel corpus. We also present the first reported controlled comparison of maximum entropy and SVM approaches to shallow semantic parsing, using the Chinese data.
We present the first known direct measurement of word alignment coverage on an Arabic-English parallel corpus using inversion transduction grammar constraints. While direct measurements have been reported for several ...
详细信息
We present the first known direct measurement of word alignment coverage on an Arabic-English parallel corpus using inversion transduction grammar constraints. While direct measurements have been reported for several European and Asian languages, to date no results have been available for Arabic or any Semitic language despite much recent activity on Arabic- English spoken language and text translation. Many recent syntax based statistical MT models operate within the domain of ITG expressiveness, often for efficiency reasons, so it has become important to determine the extent to which the ITG constraint assumption holds. Our results on Arabic provide further evidence that ITG expressiveness appears largely sufficient for core MT models.
This paper describes the development and collection as well as initial analysis of the EARS (Effective, Affordable, Reusable Speech-to-text) Chinese telephony speech corpus (EARSCTS). The corpus contains 1206 ten-minu...
详细信息
This paper describes the development and collection as well as initial analysis of the EARS (Effective, Affordable, Reusable Speech-to-text) Chinese telephony speech corpus (EARSCTS). The corpus contains 1206 ten-minute natural Mandarin conversations between either strangers or friends. The total amount of the corpus is 200 hours. There are 40 topics in all conversations and each conversation focuses on a single topic. All the speech data are recorded over public telephone networks, e.g., landline and cellular channels. All the speech data are annotated manually with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech, time alignment information is also provided. This corpus can be used for conversational and spontaneous Mandarin speech recognition and other application-dependent tasks. The EARSCTS corpus is the largest and first of its kind for Mandarin conversational telephony speech, providing sufficient and diversified samples needed for speech training, testing, adaptation and development.
Code-switching is very common among bilingual speakers. Spoken queries by these speakers are typically in mixed language. In this paper, we propose an unsupervised method for mixed-language query understanding, using ...
详细信息
We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and ...
详细信息
We present the first known result for named entity recognition (NER) in realistic large-vocabulary spoken Chinese. We establish this result by applying a maximum entropy model, currently the single best known approach...
详细信息
This paper describes the development of the EARS (effective, affordable, reusable speech-to-text) Chinese corpus, a telephony conversational speech database for speech processing. The EARS database is the first of its...
详细信息
ISBN:
(纸本)0780386787
This paper describes the development of the EARS (effective, affordable, reusable speech-to-text) Chinese corpus, a telephony conversational speech database for speech processing. The EARS database is the first of its kind collected for Mandarin Chinese telephony spontaneous speech. The purpose of developing this EARS Chinese corpus is to collect Mandarin conversations between either strangers or friends, which cover a wide range of topics, over landline and cellular channels. All the speech data are annotated with standard Chinese character transcription as well as specific mark-ups for spontaneous speech. This corpus will be used for conversational and spontaneous Mandarin speech recognition tasks, under the DARPA EARS framework. This paper introduces the design, development, structure, and initial phonetic analysis of the first 50-hour collection of this corpus. Additional 300 to 500 hours of data will be collected and transcribed between 2004 and 2005.
Extending the phone set is one common approach for dealing with phonetic confusions in spontaneous speech. We propose using likelihood ratio test as a confidence measure for automatic phone set extension to model phon...
详细信息
Extending the phone set is one common approach for dealing with phonetic confusions in spontaneous speech. We propose using likelihood ratio test as a confidence measure for automatic phone set extension to model phonetic confusions. We first extend the standard phone set using dynamic programming (DP) alignment to cover all possible phonetic confusions in training data. Likelihood ratio test is then used as a confidence measure to optimize the extended phonetic units to represent the acoustic samples between two standard phonetic units with high confusability. The optimum set of extended phonetic units is combined with the standard phone set to form a multiple pronunciation dictionary. The effectiveness of this approach is evaluated on spontaneous Mandarin telephony speech. It gives an encouraging 1.09% absolute syllable error rate reduction. Using the extended phone set provides a good balance between the demands of high resolution acoustic model and the available training data.
The high error rate of recognition accuracy in spontaneous speech is due in part to the poor modeling of pronunciations. In this paper, we propose modeling pronunciation variations through triphone model reconstructio...
详细信息
The high error rate of recognition accuracy in spontaneous speech is due in part to the poor modeling of pronunciations. In this paper, we propose modeling pronunciation variations through triphone model reconstruction. We first generate a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the resolution of triphone models, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for triphone PCPM. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News Corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% absolute syllable error rate reduction in spontaneous speech.
暂无评论