Target phrase selection, a crucial component of the state-of-the-art phrase-based statistical machine translation(PBSMT) model, plays a key role in generating accurate translation hypotheses. Inspired by context-rich ...
详细信息
Target phrase selection, a crucial component of the state-of-the-art phrase-based statistical machine translation(PBSMT) model, plays a key role in generating accurate translation hypotheses. Inspired by context-rich word-sense disambiguation techniques, machine translation (MT) researchers have successfully integrated various types of source language context into the PBSMT model to improve target phrase selection. Among the various types of lexical and syntactic features, lexical syntactic descriptions in the form of super tags that preserve long-range word-to-word dependencies in a sentence have proven to be effective. These rich contextual features are able to disambiguate a source phrase, on the basis of the local syntactic behaviour of that phrase. In addition to local contextual information, global contextual information such as the grammatical structure of a sentence, sentence length and n-gram word sequences could provide additional important information to enhance this phrase-sense disambiguation. In this work, we explore various sentence similarity features by measuring similarity between a source sentence to be translated with the source-side of the bilingual training sentences and integrate them directly into the PBSMT model. We performed experiments on an English-to-Chinese translation task by applying sentence-similarity features both individually, and collaboratively with super tag-based features. We evaluate the performance of our approach and report a statistically significant relative improvement of 5.25% BLEU score when adding a sentence-similarity feature together with a super tag-based feature.
We use web-scale N-grams in a base NP parser that correctly analyzes 95.4%of the base NPs in natural ***-scale data improves *** is,there is no data like more *** scales log-linearly with the number of parameters in t...
详细信息
We use web-scale N-grams in a base NP parser that correctly analyzes 95.4%of the base NPs in natural ***-scale data improves *** is,there is no data like more *** scales log-linearly with the number of parameters in the model(the number of unique N-grams).The web-scale N-grams are particularly helpful in harder cases,such as NPs that contain conjunctions.
MLP based front-ends have shown significant complementary properties to conventional spectral features. As part of the DARPA GALE program, different MLP features were developed for Mandarin ASR. In this paper, all the...
详细信息
For CLEF 2008 JHU conducted monolingual and bilingual experiments in the ad hoc TEL and Persian tasks. Additionally we performed several post hoc experiments using previous CLEF ad hoc tests sets in 13 languages. In a...
详细信息
ISBN:
(纸本)9783642044465
For CLEF 2008 JHU conducted monolingual and bilingual experiments in the ad hoc TEL and Persian tasks. Additionally we performed several post hoc experiments using previous CLEF ad hoc tests sets in 13 languages. In all three tasks we explored alternative methods of tokenizing documents including plain words, stemmed words, automatically induced segments, a single selected n-gram from each word, and all n-grams from words (i.e., traditional character n-grams). Character n-grams demonstrated consistent gains over ordinary words in each of these three diverse sets of experiments. Using mean average precision, relative gains of of 50-200% on the TEL task, 5% on the Persian task, and 18% averaged over 13 languages from past CLEF evaluations, were observed.
At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian evaluation. Variants of character n-gram tokenization provided a 10% relative gain over unnormalized words. A run based on skip n-grams, ...
详细信息
At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian evaluation. Variants of character n-gram tokenization provided a 10% relative gain over unnormalized words. A run based on skip n-grams, which allow internal skipped letters, achieved a mean average precision of 0.4938. Using traditional 5-grams resulted in a score of 0.4868 while plain words had a score of 0.4463.
There are several dictionaries which are developed in form-based electronics format. Since these are designed to use only on manuscript dictionary, they are limited on usage in other ways, such as electronic dictionar...
详细信息
ISBN:
(纸本)9789868473553
There are several dictionaries which are developed in form-based electronics format. Since these are designed to use only on manuscript dictionary, they are limited on usage in other ways, such as electronic dictionary with word associations, translation, question answering, content retrieval, and so on. Content based dictionary is more flexible to apply to such applications. In this paper, we describe the methodology to transform a form-based to content based structure in Pali-Thai dictionary. This dictionary aims to be an infrastructure for many applications in Buddhism domain.
We describe Joshua (Li et al., 2009a)1, an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for translation via synchronous context free grammars (SCFGs): chart...
详细信息
Current information extraction systems can do a good job of discov-ering entities, relations and events in natural language text. The traditional out-put of such systems is XML, with the ACE Pilot Format (APF) schema ...
详细信息
Current information extraction systems can do a good job of discov-ering entities, relations and events in natural language text. The traditional out-put of such systems is XML, with the ACE Pilot Format (APF) schema as a common target. We are developing a system that will take the output of an in-formation extraction system as APF documents and directly populate a knowl-edge base with the information extracted. We report on an initial OWL ontol-ogy that covers the APF schema, a simple program to convert a set of APF documents to RDF data and a demonstration system build with Exhibit to view the results.
We describe Joshua, an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for synchronous context free grammars (SCFGs): chart-parsing, ngram language model integ...
详细信息
This paper presents six novel approaches to biographic fact extraction that model structural, transitive and latent properties of biographical data. The ensemble of these proposed models substantially outperforms stan...
详细信息
暂无评论