We describe a set of techniques for Arabic cross-document coreference resolution. We compare a baseline system of exact mention string-matching to ones that include local mention context information as well as informa...
详细信息
Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could benefit information retrieval. This paper presents results in alternative me...
详细信息
Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could benefit information retrieval. This paper presents results in alternative methods for word normalization using test sets from the Cross-language Evaluation Forum (CLEF) ad-hoc collections. Preliminary results for the Morpho Challenge 2008 evaluation are consistent with these data. We found that: (1) rule-based stemming is effective in less morphologically complicated languages;(2) alternative methods for stemming such as unsupervised learning of morphemes and least common n-gram stemming are helpful;and, (3) full character n-gram indexing is the most effective form of tokenization in more morphologically complex languages.
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in...
详细信息
Knowing the degree of antonymy between words has widespread applications in natural languageprocessing. Manually-created lexicons have limited coverage and do not include most semantically contrasting word pairs. We ...
详细信息
Knowing the degree of antonymy between words has widespread applications in natural languageprocessing. Manually-created lexicons have limited coverage and do not include most semantically contrasting word pairs. We ...
Knowing the degree of antonymy between words has widespread applications in natural languageprocessing. Manually-created lexicons have limited coverage and do not include most semantically contrasting word pairs. We present a new automatic and empirical measure of antonymy that combines corpus statistics with the structure of a published thesaurus. The approach is evaluated on a set of closest-opposite questions, obtaining a precision of over 80%. Along the way, we discuss what humans consider antonymous and how antonymy manifests itself in utterances.
This paper describes a computational approach to resolving the true referent of a named mention of a person in the body of an email. A generative model of mention generation is used to guide mention resolution. Result...
详细信息
Rapid and inexpensive techniques for automatic transcription of speech have the potential to dramatically expand the types of content to which information retrieval techniques can be productively applied, but limitati...
详细信息
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in...
详细信息
This paper proposes a novel approach called noisecluster HMM interpolation for robust speech *** approach helps alleviating the problem of speech recognition under noisy environments not trained in the *** this method...
详细信息
This paper proposes a novel approach called noisecluster HMM interpolation for robust speech *** approach helps alleviating the problem of speech recognition under noisy environments not trained in the *** this method,a new HMM is interpolated from existing noisy-speech HMMs that are best matched to the input *** process is performed on-the-fly with an acceptable delay time and,hence,no need to prepare and store the final model in *** weights among HMMs can be determined by either a direct or a tree-structured *** focusing on speech in unseen noisy-environments,the proposed method obviously outperforms a baseline system whose acoustic model for such unseen environment is selected from a tree structure.
This paper describes an accurate feature representation for continuous clean speech recognition. The main components of the technique involve performing a moderate order Linear Predictive (LP) analysis and computing t...
详细信息
This paper describes an accurate feature representation for continuous clean speech recognition. The main components of the technique involve performing a moderate order Linear Predictive (LP) analysis and computing the Minimum Variance Distortionless Response (MVDR) spectrum from these LP coefficients. This feature representation, PMCCs, was earlier shown to yield superior performance over MFCCs for different noise conditions with emphasis on car noise [1]. The performance improvement was then attributed to better spectrum and envelope modeling properties of the MVDR methodology. This study shows that the representation is also quite efficient for clean speech recognition. In fact, PMCCs are shown to be a more accurate envelope representation and reduce speaker variability. This, in turn, yields a 12.8% relative word error rate (WER) reduction on the coombination of Wall Street Journal (WSJ) Nov?92 dev/eval sets with respect to the MFCCs. Accurate envelope modeling and reduction in the speaker variability also lead to faster decoding, based on efficient pruning in the search stage. The total gain in the decoding speed is 22.4%, relative to the standard MFCC features. It is also shown that PMCCs are not very demanding in terms of computation when compared to MFCCs. Therefore, we conclude that PMCC feature extraction scheme is a better representation of clean speech as well as noisy speech than MFCC scheme.
暂无评论