检索结果-内蒙古大学图书馆

Pinyin-indexed method for approximate matching in Chinese

Qinghua Daxue Xuebao/Journal of Tsinghua University 2009年第SUPPL. 1期49卷 1328-1332页

作者： Cao, Jiang Wu, Xiaojun Xia, Yunqing Zheng, Fang Department of Computer Science and Technology Tsinghua University Beijing 100084 China Center for Speech and Language Technologies Division of Technical Innovation and Development Tsinghua National Laboratory for Information Science and Technology Beijing 100084 China

The exact matching of keywords is key to popular commercial search engines. A Chinese approximate matching method with an index structure was developed to achieve better retrieval when the input contains errors. Three types of similarity measurement between two Chinese strings were developed based on the character edit-distance, the Pinyin edit-distance and the Pinyin improved edit-distance. The similarity measurements were used to expand the user's query so that the approximate matching task can be represented as several exact matching sub-tasks. The results of these exact matchings are merged and sorted by their similarity to the original query. Tests on a webpage text database gave a 50.4% recall rate with the Pinyin improved edit-distance with a 60.4% precision with a small increase in time and space complexity.

关键词： Search engines

来源：评论

学校读者我要写书评

暂无评论

Automatic grammar inference based on sentence segmentation for spoken Chinese

引用

Qinghua Daxue Xuebao/Journal of Tsinghua University 2009年第SUPPL. 1期49卷 1322-1327页

作者： Zhang, He Wu, Xiaojun Wang, Xiaodong Zheng, Fang College of Computer and Information Technology Henan Normal University Xinxiang 453007 China Center for Speech and Language Technologies Tsinghua National Laboratory for Information Science and Development Tsinghua University Beijing 100084 China

The grammar for spoken dialogue systems for information enquiry is often manually designed by experts. Automatic grammar inference method based on sentence segmentation was developed based on an enhanced context free grammar for spoken Chinese. The system parses the training sentences with an initial rule set. If the parsed syntactic tree is incomplete, the top-most constituents are used to recursively infer the missing rules after disambiguation and normalization, and then the rule set is updated. The output grammar is improved by adjusting the processing order of the training sentences to refine the process. Evaluations based on weather forecast enquiries gave a parsing accuracy for the output grammar of 64.8% with an empty initial rule set and 86.4% with an initial rule set including only rules for date descriptions.

关键词： Context free grammars

来源：评论

学校读者我要写书评

暂无评论

A scalable decoder for parsing-based machine translation with equivalent language model state maintenance 2

A scalable decoder for parsing-based machine translation wit...

引用

2nd Workshop on Syntax and Structure in Statistical Translation, SSST 2008

作者： Li, Zhifei Khudanpur, Sanjeev Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University BaltimoreMD21218 United States

ISBN: (纸本)9781932432169

We describe a scalable decoder for parsing-based machine translation. The decoder is written in JAVA and implements all the essential algorithms described in Chiang (2007): chart-parsing, m-gram language model integration, beam- and cube-pruning, and unique k-best extraction. Additionally, parallel and distributed computing techniques are exploited to make it scalable. We also propose an algorithm to maintain equivalent language model states that exploits the back-off property of m-gram language models: instead of maintaining a separate state for each distinguished sequence of "state" words, we merge multiple states that can be made equivalent for language model probability calculations due to back-off. We demonstrate experimentally that our decoder is more than 30 times faster than a baseline decoder written in PYTHON. We propose to release our decoder as an open-source toolkit. © 2008 Association for Computational Linguistics

关键词： Computational linguistics

来源：评论

学校读者我要写书评

暂无评论

Translating Compounds by Learning Component Gloss Translation Models via Multiple languages 3

Translating Compounds by Learning Component Gloss Translatio...

引用

3rd International Joint Conference on Natural language processing, IJCNLP 2008

作者： Garera, Nikesh Yarowsky, David Department of Computer Science Center for Language and Speech Processing Johns Hopkins University BaltimoreMD21218 United States

This paper presents an approach to the translation of compound words without the need for bilingual training text, by modeling the mapping of literal component word glosses (e.g. "iron-path") into fluent English (e.g. "railway") across multiple languages. Performance is improved by adding component-sequence and learnedmorphology models along with context similarity from monolingual text and optional combination with traditional bilingual-textbased translation discovery. © 2008 IJCNLP 2008 - 3rd International Joint Conference on Natural language processing, Proceedings of the Conference. All rights reserved.

关键词： Modeling languages

来源：评论

学校读者我要写书评

暂无评论

Unsupervised translation induction for chinese abbreviations using monolingual corpora

Unsupervised translation induction for chinese abbreviations...

引用

46th Annual Meeting of the Association for Computational Linguistics: Human language Technologies, ACL-08: HLT

作者： Li, Zhifei Yarowsky, David Department of Computer Science Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21218 United States

ISBN: (纸本)9781932432046

Chinese abbreviations are widely used in modern Chinese texts. Compared with English abbreviations (which are mostly acronyms and truncations), the formation of Chinese abbreviations is much more complex. Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora, in which case current machine translation systems simply treat them as unknown words and leave them untranslated. In this paper, we present a novel unsupervised method that automatically extracts the relation between a full-form phrase and its abbreviation from monolingual corpora, and induces translation entries for the abbreviation by using its full-form as a bridge. Our method does not require any additional annotated data other than the data that a regular translation system uses. We integrate our method into a state-ofthe- art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets. © 2008 Association for Computational Linguistics.

关键词： Computational linguistics

来源：评论

学校读者我要写书评

暂无评论

Large-scale discriminative n-gram language models for statistical machine translation

Large-scale discriminative n-gram language models for statis...

引用

8th Biennial Conference of the Association for Machine Translation in the Americas, AMTA 2008

作者： Li, Zhifei Khudanpur, Sanjeev Department of Computer Science Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21218 United States

We extend discriminative n-gram language modeling techniques originally proposed for automatic speech recognition to a statistical machine translation task. In this context, we propose a novel data selection method that leads to good models using a fraction of the training data. We carry out systematic experiments on several benchmark tests for Chinese to English translation using a hierarchical phrase-based machine translation system, and show that a discriminative language model significantly improves upon a state-of-the-art baseline. The experiments also highlight the benefits of our data selection method.

关键词： Computational linguistics

来源：评论

学校读者我要写书评

暂无评论

Mining and modeling relations between formal and informal Chinese phrases from web corpora

Mining and modeling relations between formal and informal Ch...

引用

2008 Conference on Empirical Methods in Natural language processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken language Translation

作者： Li, Zhifei Yarowsky, David Department of Computer Science Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21218 United States

We present a novel method for discovering and modeling the relationship between informal Chinese expressions (including colloquialisms and instant-messaging slang) and their formal equivalents. Specifically, we proposed a bootstrapping procedure to identify a list of candidate informal phrases in web corpora. Given an informal phrase, we retrieve contextual instances from the web using a search engine, generate hypotheses of formal equivalents via this data, and rank the hypotheses using a conditional log-linear model. In the log-linear model, we incorporate as feature functions both rule-based intuitions and data co-occurrence phenomena (either as an explicit or indirect definition, or through formal/informal usages occurring in free variation in a discourse). We test our system on manually collected test examples, and find that the (formal-informal) relationship discovery and extraction process using our method achieves an average 1-best precision of 62%. Given the ubiquity of informal conversational style on the internet, this work has clear applications for text normalization in text-processing systems including machine translation aspiring to broad coverage. © 2008 Association for Computational Linguistics.

关键词： Search engines

来源：评论

学校读者我要写书评

暂无评论

Minimally Supervised Multilingual Taxonomy and Translation Lexicon Induction 3

Minimally Supervised Multilingual Taxonomy and Translation L...

引用

3rd International Joint Conference on Natural language processing, IJCNLP 2008

作者： Garera, Nikesh Yarowsky, David Department of Computer Science Center for Language and Speech Processing Johns Hopkins University BaltimoreMD21218 United States

We present a novel algorithm for the acquisition of multilingual lexical taxonomies (including hyponymy/hypernymy, meronymy and taxonomic cousinhood), from monolingual corpora with minimal supervision in the form of seed exemplars using discriminative learning across the major WordNet semantic relationships. This capability is also extended robustly and effectively to a second language (Hindi) via cross-language projection of the various seed exemplars. We also present a novel model of translation dictionary induction via multilingual transitive models of hypernymy and hyponymy, using these induced taxonomies. Candidate lexical translation probabilities are based on the probability that their induced hyponyms and/or hypernyms are translations of one another. We evaluate all of the above models on English and Hindi. © 2008 IJCNLP 2008 - 3rd International Joint Conference on Natural language processing, Proceedings of the Conference. All rights reserved.

关键词： Taxonomies

来源：评论

学校读者我要写书评

暂无评论

A new approach to query expansion in information retrieval

引用

High Technology Letters 2008年第1期14卷 77-80页

作者：李卫疆 Zhao Tiejun Wang Xian＇gang MOE-MS Key Laboratory of Natural Language Processing and Speech -School of Computer Science and Technology Harbin Institute of Technology Harbin 150001 P.R. China

To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of *** expansion terms were selected according to their correlation to the whole *** the sametime,the position information between terms were *** experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%～19% all the time than the language modeling method without *** to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.

关键词： information retrieval language model query expansion

来源：评论

学校读者我要写书评

暂无评论

Exploiting prosodic breaks in language modeling with random forests

Exploiting prosodic breaks in language modeling with random ...

引用

4th International Conference on speech Prosody 2008, SP 2008

作者： Su, Yi Jelinek, Frederick Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University Baltimore MD United States

ISBN: (纸本)9780616220030

We propose a novel method of exploiting prosodic breaks in language modeling for automatic speech recognition (ASR) based on the random forest language model (RFLM), which is a collection of randomized decision tree language models and can potentially ask any questions about the history in order to predict the future. We demonstrate how questions about prosodic breaks can be easily incorporated into the RFLM and present two language models which treat prosodic breaks as observable and hidden variables, respectively. Meanwhile, we show empirically that a finer grained prosodic break is needed for language modeling. Experimental results showed that given prosodic breaks, we were able to reduce the LM perplexity by a significant margin, suggesting a prosodic N-best rescoring approach for ASR.

关键词： Decision trees

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：