At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian evaluation. Variants of character n-gram tokenization provided a 10% relative gain over unnormalized words. A run based on skip n-grams, ...
详细信息
At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian evaluation. Variants of character n-gram tokenization provided a 10% relative gain over unnormalized words. A run based on skip n-grams, which allow internal skipped letters, achieved a mean average precision of 0.4938. Using traditional 5-grams resulted in a score of 0.4868 while plain words had a score of 0.4463.
There are several dictionaries which are developed in form-based electronics format. Since these are designed to use only on manuscript dictionary, they are limited on usage in other ways, such as electronic dictionar...
详细信息
ISBN:
(纸本)9789868473553
There are several dictionaries which are developed in form-based electronics format. Since these are designed to use only on manuscript dictionary, they are limited on usage in other ways, such as electronic dictionary with word associations, translation, question answering, content retrieval, and so on. Content based dictionary is more flexible to apply to such applications. In this paper, we describe the methodology to transform a form-based to content based structure in Pali-Thai dictionary. This dictionary aims to be an infrastructure for many applications in Buddhism domain.
We describe Joshua (Li et al., 2009a)1, an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for translation via synchronous context free grammars (SCFGs): chart...
详细信息
Current information extraction systems can do a good job of discov-ering entities, relations and events in natural language text. The traditional out-put of such systems is XML, with the ACE Pilot Format (APF) schema ...
详细信息
Current information extraction systems can do a good job of discov-ering entities, relations and events in natural language text. The traditional out-put of such systems is XML, with the ACE Pilot Format (APF) schema as a common target. We are developing a system that will take the output of an in-formation extraction system as APF documents and directly populate a knowl-edge base with the information extracted. We report on an initial OWL ontol-ogy that covers the APF schema, a simple program to convert a set of APF documents to RDF data and a demonstration system build with Exhibit to view the results.
We describe Joshua, an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for synchronous context free grammars (SCFGs): chart-parsing, ngram language model integ...
详细信息
This paper presents six novel approaches to biographic fact extraction that model structural, transitive and latent properties of biographical data. The ensemble of these proposed models substantially outperforms stan...
详细信息
This paper presents and evaluates several original techniques for the latent classification of biographic attributes such as gender, age and native language, in diverse genres (conversation transcripts, email) and lan...
详细信息
Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the correlation of the scores they assign to MT output with human judgments of translation performance. Different types of hum...
详细信息
In this paper, we describe and evaluate a bigram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of lat...
详细信息
We present a scalable joint language model designed to utilize fine-grain syntactic tags. We discuss challenges such a design faces and describe our solutions that scale well to large tagsets and corpora. We advocate ...
详细信息
暂无评论