检索结果-内蒙古大学图书馆

5th SIGHAN Workshop on Chinese language Processing, co-located with COLING/ACL 2006

作者： Yu, Xiaofeng Carpuat, Marine Wu, Dekai Human Language Technology Center HKUST Department of Computer Science and Engineering University of Science and Technology Clear Water Bay Hong Kong Hong Kong

ISBN: (纸本)1932432701

We report an experiment in which a high-performance boosting based NER model originally designed for multiple European languages is instead applied to the Chinese named entity recognition task of the third SIGHAN Chinese language processing bakeoff. Using a simple character-based model along with a set of features that are easily obtained from the Chinese input strings, the system described employs boosting, a promising and theoretically well-founded machine learning method to combine a set of weak classifiers together into a final system. Even though we did no other Chinese-specific tuning, and used only one-third of the MSRA and CityU corpora to train the system, reasonable results are obtained. Our evaluation results show that 75.07 and 80.51 overall F-measures were obtained on MSRA and CityU test sets respectively. © 2006 Association for Computational Linguistics.

关键词： Adaptive boosting

来源：评论

学校读者我要写书评

暂无评论

Automatic learning of Chinese English semantic structure mapping

Automatic learning of Chinese English semantic structure map...

引用

IEEE Spoken language technology Workshop

作者： Pascale Fung Wu Zhaojun Yang Yongsheng Dekai Wu Human Language Technology Center Department of Electronic and Computer EngineeringDepartment of Computer Science and Engineering University of Science and Technology (HKUST) Hong Kong China

We present twin results on Chinese semantic parsing, with application to English-Chinese cross- lingual verb frame acquisition. First, we describe two new state-of-the-art Chinese shallow semantic parsers leading to an F-score of 82.01 on simultaneous frame and argument boundary identification and labeling. Subsequently, we propose a model that applies the separate Chinese and English semantic parsers to learn cross-lingual semantic verb frame argument mappings with 89.3% accuracy. The only training data needed by this cross-lingual learning model is a pair of non-parallel monolingual Propbanks, plus an unannotated parallel corpus. We also present the first reported controlled comparison of maximum entropy and SVM approaches to shallow semantic parsing, using the Chinese data.

关键词： Natural languages Labeling Training data Error correction US Government humans computer science Application software Entropy Support vector machines

来源：评论

学校读者我要写书评

暂无评论

Inversion transduction grammar coverage of arabic-english word alignment for tree-structured statistical machine translation

Inversion transduction grammar coverage of arabic-english wo...

引用

IEEE Spoken language technology Workshop

作者： Dekai Wu Marine Carpuat Yihai Shen HKUST Department of Computer Science and Engineering Human Language Technology Center Hong Kong China

We present the first known direct measurement of word alignment coverage on an Arabic-English parallel corpus using inversion transduction grammar constraints. While direct measurements have been reported for several European and Asian languages, to date no results have been available for Arabic or any Semitic language despite much recent activity on Arabic- English spoken language and text translation. Many recent syntax based statistical MT models operate within the domain of ITG expressiveness, often for efficiency reasons, so it has become important to determine the extent to which the ITG constraint assumption holds. Our results on Arabic provide further evidence that ITG expressiveness appears largely sufficient for core MT models.

关键词： Natural languages Decoding Context modeling Hidden Markov models Error analysis humans Marine technology computer science Oral communication Formal languages

来源：评论

学校读者我要写书评

暂无评论

EARSCTS:A Chinese Telephony Conversational Corpus for Speech Processing

EARSCTS:A Chinese Telephony Conversational Corpus for Speech...

引用

第八届全国人机语音通讯学术会议

作者： Pascale Fung Christopher Cieri Human Language Technology Center Department of Electrical and Electronic Engineering University of Science and Technology Hong Kong Linguistic Data Consortium University of Permsylvania U.S.A

This paper describes the development and collection as well as initial analysis of the EARS (Effective, Affordable, Reusable Speech-to-text) Chinese telephony speech corpus (EARSCTS). The corpus contains 1206 ten-minute natural Mandarin conversations between either strangers or friends. The total amount of the corpus is 200 hours. There are 40 topics in all conversations and each conversation focuses on a single topic. All the speech data are recorded over public telephone networks, e.g., landline and cellular channels. All the speech data are annotated manually with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech, time alignment information is also provided. This corpus can be used for conversational and spontaneous Mandarin speech recognition and other application-dependent tasks. The EARSCTS corpus is the largest and first of its kind for Mandarin conversational telephony speech, providing sufficient and diversified samples needed for speech training, testing, adaptation and development.

关键词： EARSCTS:A Chinese Telephony Conversational Corpus for Speech Processing

来源：评论

学校读者我要写书评

暂无评论

Translation disambiguation in mixed language queries

引用

Machine Translation 2004年第4期18卷 251-273页

作者： Cheung, Percy Fung, Pascale Human Language Technology Center Department of Electrical and Electronic Engineering Hong Kong University of Science and Technology Hong Kong

Code-switching is very common among bilingual speakers. Spoken queries by these speakers are typically in mixed language. In this paper, we propose an unsupervised method for mixed-language query understanding, using only a monolingual corpus and a bilingual dictionary. Secondary-language words mixed in a primary-language query are translated into words in the primary language. We found that using a single disambiguation feature for translation is more effective than using multiple features, provided this feature is based on the most salient seed-word, chosen automatically by confidence scoring. We propose and compare four types of disambiguation features that are based on context seed-words. A baseline method uses the nearest neighboring seed-word as disambiguation feature. Multiple-context seed-word voting is also proposed in order to enlarge the context window. On the other hand, merely using the inverse-distance as weights on context words degrades the performance as it runs counter to the potential underlying syntactic relations between words. Our final proposal is a solution that uses multiple-context seed-words and the translation candidates of all mixed language words to select a single most salient seed-word for translation disambiguation. The translation disambiguation accuracy for this feature is at 83.7% for all words in the ATIS spontaneous speech query database, and 66.7% for content words. © 2005 Springer.

关键词： Information retrieval

来源：评论

学校读者我要写书评

暂无评论

Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus 20

Multi-level bootstrapping for extracting parallel sentences ...

引用

20th International Conference on Computational Linguistics, COLING 2004

作者： Fung, Pascale Cheung, Percy Human Language Technology Center Department of Electrical and Electronic Engineering HKUST Clear Water Bay Hong Kong

We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better sentence matching also leads to better document matching. Based on this, we use multi-level bootstrapping to improve the alignments between documents, sentences, and bilingual word pairs, iteratively. Our method is the first method that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article. It is validated by experimental results that show a 23% improvement over a method without multilevel bootstrapping. © 2004 COLING 2004 - Proceedings of the 20th International Conference on Computational Linguistics. All rights reserved.

关键词： Iterative methods

来源：评论

学校读者我要写书评

暂无评论

Using n-best lists for named entity recognition from Chinese speech

Using n-best lists for named entity recognition from Chinese...

引用

2004 human language technology Conference of the North American Chapter of the Association for Computational Linguistics - Short Papers, HLT-NAACL 2004

作者： Zhai, Lufeng Fung, Pascale Schwartz, Richard Carpuat, Marine Wu, Dekai HKUST Human Language Technology Center Electrical and Electronic Engineering University of Science and Technology Clear Water Bay Hong Kong Hong Kong BBN Technologies 9861 Broken Land Parkway ColumbiaMD21046 United States HKUST Human Language Technology Center Department of Computer Science University of Science and Technology Clear Water Bay Hong Kong Hong Kong

ISBN: (纸本)1932432248

We present the first known result for named entity recognition (NER) in realistic large-vocabulary spoken Chinese. We establish this result by applying a maximum entropy model, currently the single best known approach for textual Chinese NER, to the recognition output of the BBN LVCSR system on Chinese Broadcast News utterances. Our results support the claim that transferring NER approaches from text to spoken language is a significantly more difficult task for Chinese than for English. We propose re-segmenting the ASR hypotheses as well as applying post-classification to improve the performance. Finally, we introduce a method of using n-best hypotheses that yields a small but nevertheless useful improvement NER accuracy. We use acoustic, phonetic, language model, NER and other scores as confidence measure. Experimental results show an average of 6.7% relative improvement in precision and 1.7% relative improvement in F-measure. © HLT-NAACL *** right reserved.

关键词： Speech recognition

来源：评论

学校读者我要写书评

暂无评论

Development of a Chinese telephony conversational corpus for speech processing [speech recognition applications]

Development of a Chinese telephony conversational corpus for...

引用

International Symposium on Chinese Spoken language Processing

作者： Liu Yi P. Fung S. Huang C. Cieri Z. Lufeng C. Benfeng Human Language Technology Center Department of Electrical and Electronic Engineering University of Science and Technology (HKUST) Hong Kong China Linguistic Data Consortium University of Pennsylvania USA

ISBN: (纸本)0780386787

This paper describes the development of the EARS (effective, affordable, reusable speech-to-text) Chinese corpus, a telephony conversational speech database for speech processing. The EARS database is the first of its kind collected for Mandarin Chinese telephony spontaneous speech. The purpose of developing this EARS Chinese corpus is to collect Mandarin conversations between either strangers or friends, which cover a wide range of topics, over landline and cellular channels. All the speech data are annotated with standard Chinese character transcription as well as specific mark-ups for spontaneous speech. This corpus will be used for conversational and spontaneous Mandarin speech recognition tasks, under the DARPA EARS framework. This paper introduces the design, development, structure, and initial phonetic analysis of the first 50-hour collection of this corpus. Additional 300 to 500 hours of data will be collected and transcribed between 2004 and 2005.

关键词： Telephony Speech processing Databases Automatic speech recognition Ear Natural languages Speech recognition Loudspeakers Speech synthesis Microphones

来源：评论

学校读者我要写书评

暂无评论

Automatic phone set extension with confidence measure for spontaneous speech 8

Automatic phone set extension with confidence measure for sp...

引用

8th European Conference on Speech Communication and technology, EUROSPEECH 2003

作者： Liu, Yi Fung, Pascale Human Language Technology Center Department of Electrical and Electronic Engineering Hong Kong University of Science and Technology Hong Kong China

Extending the phone set is one common approach for dealing with phonetic confusions in spontaneous speech. We propose using likelihood ratio test as a confidence measure for automatic phone set extension to model phonetic confusions. We first extend the standard phone set using dynamic programming (DP) alignment to cover all possible phonetic confusions in training data. Likelihood ratio test is then used as a confidence measure to optimize the extended phonetic units to represent the acoustic samples between two standard phonetic units with high confusability. The optimum set of extended phonetic units is combined with the standard phone set to form a multiple pronunciation dictionary. The effectiveness of this approach is evaluated on spontaneous Mandarin telephony speech. It gives an encouraging 1.09% absolute syllable error rate reduction. Using the extended phone set provides a good balance between the demands of high resolution acoustic model and the available training data.

关键词： Dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Triphone model reconstruction for Mandarin pronunciation variations

Triphone model reconstruction for Mandarin pronunciation var...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： P. Fung Liu Yi Human Language Technology Center Department of Electrical and Electronic Engineering University of Science and Technology (HKUST) Hong Kong China

The high error rate of recognition accuracy in spontaneous speech is due in part to the poor modeling of pronunciations. In this paper, we propose modeling pronunciation variations through triphone model reconstruction. We first generate a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the resolution of triphone models, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for triphone PCPM. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News Corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% absolute syllable error rate reduction in spontaneous speech.

关键词： Hidden Markov models Decision trees Merging Error analysis humans Natural languages Speech recognition Broadcasting Speech analysis Computational efficiency

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：