The exact matching of keywords is key to popular commercial search engines. A Chinese approximate matching method with an index structure was developed to achieve better retrieval when the input contains errors. Three...
详细信息
The exact matching of keywords is key to popular commercial search engines. A Chinese approximate matching method with an index structure was developed to achieve better retrieval when the input contains errors. Three types of similarity measurement between two Chinese strings were developed based on the character edit-distance, the Pinyin edit-distance and the Pinyin improved edit-distance. The similarity measurements were used to expand the user's query so that the approximate matching task can be represented as several exact matching sub-tasks. The results of these exact matchings are merged and sorted by their similarity to the original query. Tests on a webpage text database gave a 50.4% recall rate with the Pinyin improved edit-distance with a 60.4% precision with a small increase in time and space complexity.
The grammar for spoken dialogue systems for information enquiry is often manually designed by experts. Automatic grammar inference method based on sentence segmentation was developed based on an enhanced context free ...
详细信息
The grammar for spoken dialogue systems for information enquiry is often manually designed by experts. Automatic grammar inference method based on sentence segmentation was developed based on an enhanced context free grammar for spoken Chinese. The system parses the training sentences with an initial rule set. If the parsed syntactic tree is incomplete, the top-most constituents are used to recursively infer the missing rules after disambiguation and normalization, and then the rule set is updated. The output grammar is improved by adjusting the processing order of the training sentences to refine the process. Evaluations based on weather forecast enquiries gave a parsing accuracy for the output grammar of 64.8% with an empty initial rule set and 86.4% with an initial rule set including only rules for date descriptions.
We describe a scalable decoder for parsing-based machine translation. The decoder is written in JAVA and implements all the essential algorithms described in Chiang (2007): chart-parsing, m-gram language model integra...
详细信息
This paper presents an approach to the translation of compound words without the need for bilingual training text, by modeling the mapping of literal component word glosses (e.g. "iron-path") into fluent Eng...
详细信息
Chinese abbreviations are widely used in modern Chinese texts. Compared with English abbreviations (which are mostly acronyms and truncations), the formation of Chinese abbreviations is much more complex. Due to the r...
详细信息
We extend discriminative n-gram language modeling techniques originally proposed for automatic speech recognition to a statistical machine translation task. In this context, we propose a novel data selection method th...
详细信息
We extend discriminative n-gram language modeling techniques originally proposed for automatic speech recognition to a statistical machine translation task. In this context, we propose a novel data selection method that leads to good models using a fraction of the training data. We carry out systematic experiments on several benchmark tests for Chinese to English translation using a hierarchical phrase-based machine translation system, and show that a discriminative language model significantly improves upon a state-of-the-art baseline. The experiments also highlight the benefits of our data selection method.
We present a novel method for discovering and modeling the relationship between informal Chinese expressions (including colloquialisms and instant-messaging slang) and their formal equivalents. Specifically, we propos...
详细信息
We present a novel algorithm for the acquisition of multilingual lexical taxonomies (including hyponymy/hypernymy, meronymy and taxonomic cousinhood), from monolingual corpora with minimal supervision in the form of s...
详细信息
To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new t...
详细信息
To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of *** expansion terms were selected according to their correlation to the whole *** the sametime,the position information between terms were *** experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%~19% all the time than the language modeling method without *** to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.
We propose a novel method of exploiting prosodic breaks in language modeling for automatic speech recognition (ASR) based on the random forest language model (RFLM), which is a collection of randomized decision tree l...
详细信息
ISBN:
(纸本)9780616220030
We propose a novel method of exploiting prosodic breaks in language modeling for automatic speech recognition (ASR) based on the random forest language model (RFLM), which is a collection of randomized decision tree language models and can potentially ask any questions about the history in order to predict the future. We demonstrate how questions about prosodic breaks can be easily incorporated into the RFLM and present two language models which treat prosodic breaks as observable and hidden variables, respectively. Meanwhile, we show empirically that a finer grained prosodic break is needed for language modeling. Experimental results showed that given prosodic breaks, we were able to reduce the LM perplexity by a significant margin, suggesting a prosodic N-best rescoring approach for ASR.
暂无评论