<正>This paper presents a system which extracts word-based bi-gram and n-gram collocation information from a 60MB corpus and then locates bi-gram pairs using Strength and Spread as defined in the Xtract *** order fo...
详细信息
ISBN:
(纸本)0780379020
<正>This paper presents a system which extracts word-based bi-gram and n-gram collocation information from a 60MB corpus and then locates bi-gram pairs using Strength and Spread as defined in the Xtract *** order for Xtract to work effectively with Chinese,we have re-adjusted the *** obtain a higher recall rate,we have modified the algorithm to identify collocations with low-frequency of occurrence,a method which works particularly well in the case of bi-grams in which one word is high-frequency and the other low-frequency. In preliminary experiments,our system extracts bi-gram collocations with a precision of 61%,an 8% improvement over the direct use Smadja’Xtract on ***,we have improved the recall rate by 4.5%while extracting multi-word collocations with 92%precision.
In this paper,various ways of integration of Chinese word segmentation and part-of-speech tagging, including the so-called true-integration and pseudo-integration,are tested and compared based on a test corpus consist...
详细信息
In this paper,various ways of integration of Chinese word segmentation and part-of-speech tagging, including the so-called true-integration and pseudo-integration,are tested and compared based on a test corpus consisting of 367,114 Chinese characters.A novel true-integration approach,named 'the divide-and-conquer integration',is originally *** experiments show that this true integration achieves 98.72%accuracy of word segmentation,95.65%accuracy of part-of-speech tagging,and 94.43%accuracy of word segmentation and part-of-speech tagging,outperforming all other kinds of combinations to some extent(though not very significant).The results demonstrate the potential for further improving the performance of Chinese word segmentation and part-of-speech tagging.
This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical *** system extracts collocations from manually segmented and tagged Chinese news corpus in three ***,the ...
详细信息
ISBN:
(纸本)0780379020
This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical *** system extracts collocations from manually segmented and tagged Chinese news corpus in three ***,the BI-directional BI-Gram statistical measures, including Bi-directional strength and spread,andχ test value,are employed to extract candidate two-word *** candidate word pairs are then used to extract high frequency multi-word collocations from their *** the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo *** the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73%precision rate, a substantial improvement on the 61%achieved using an algorithm we developed earlier based on an improved version of the Smdja’s 53%accurate Xtract system.
Robust estimate of a large number of parameters against the availability of training data is a crucial issue in triphone based continuous speech *** cope with the issue,two major context-clustering methods,agglomerati...
详细信息
ISBN:
(纸本)0780379020
Robust estimate of a large number of parameters against the availability of training data is a crucial issue in triphone based continuous speech *** cope with the issue,two major context-clustering methods,agglomerative(AGG) and tree-based(TB),have been widely *** this paper,we analyze two algorithms with respect to their advantages and disadvantages and introduce a novel combined method that takes advantage of each method to cluster and tie similar acoustic states for highly detailed acoustic *** addition,we devise a two-level clustering approach for TB,which uses the tree-based state tying for rare acoustic phonetic events *** LVCSR,the experimental results showed the performance could be highly improved by using the proposed combined method, compared with those of using the popular TB method alone.
In supervised methods of word sense disambiguation, sense tagged samples for training classifiers are *** sense tagging for ambiguous words is expensive and labor intensive,it is worth to look for some reasonable *** ...
详细信息
In supervised methods of word sense disambiguation, sense tagged samples for training classifiers are *** sense tagging for ambiguous words is expensive and labor intensive,it is worth to look for some reasonable *** paper suggests a kind of substitute for Chinese sense tagged data, which is called pseudo training *** suggestion is based on a linguistic phenomenon in Chinese, which some multi-character-words inherit only one sense and some syntactical features from ambiguous *** derived from unambiguous multi-character-words are employed as pseudo training *** training data have an advantage of being able to be collected automatically. Our experiments show that classifiers trained by not too much pseudo training data outperform classifiers trained by small quantities of sense tagged samples for Chinese ambiguous word senses,further experiments show combination of a small set of tagged data and a large quantities pseudo training data is a more promising way to word sense disambiguation.
Spontaneous speech includes a broad range of linguistic phenomena characteristic of spoken language,and therefore a statistical approach would be effective for robust parsing of spoken *** a largescale syntactically a...
详细信息
Spontaneous speech includes a broad range of linguistic phenomena characteristic of spoken language,and therefore a statistical approach would be effective for robust parsing of spoken *** a largescale syntactically annotated corpus is required for the stochastic parsing,its construction requires a lot of human *** paper proposes a method of efficiently constructing a spoken language corpus for which the dependency analysis is *** method uses an existing spoken language corpus.A stochastic dependency parse is employed to tag spoken language sentences with the dependency structures, and the results are corrected *** tagged corpus is constructed in a spiral fashion where in the corrected data is utilized as the statistical information for automatic parsing of other *** this spiral approach reduces the parsing errors,also allowing us to reduce the correction *** experiment using 10.995 Japanese utterances shows the spiral approach to be effective for efficient corpus construction.
We present an integrated phrase segmentation/alignment algorithm(ISA)for Statistical Machine *** the need of building an initial word-to-word alignment or initially segmenting the monolingual text into phrases as othe...
详细信息
We present an integrated phrase segmentation/alignment algorithm(ISA)for Statistical Machine *** the need of building an initial word-to-word alignment or initially segmenting the monolingual text into phrases as other methods do,this algorithm segments the sentences into phrases and finds their alignments *** each sentence pair,ISA builds a two-dimensional matrix to represent a sentence pair where the value of each cell corresponds to the Point-wise Mutual Information(MI)between the source and target *** on the similarities of MI values among cells,we identify the aligned phrase *** all the phrase pairs are found,we know both how to segment one sentence into phrases and also the alignments between the source and target *** use monolingual bigram language models to estimate the joint probabilities of the identified phrase *** joint probabilities are then normalized to conditional probabilities,which are used by the *** its simplicity,this approach yields phrase-to-phrase translations with significant higher precisions than our baseline system where phrase translations are extracted from the HMM word *** we combine the phrase-to-phrase translations generated by this algorithm with the baseline system,the improvement on translation quality is even larger.
Ontologies provide advantages of knowledge reusability, sharing,and greater robustness when used to build large knowledge-based ***, translating between English statements and a specific ontology requires skill in kno...
详细信息
Ontologies provide advantages of knowledge reusability, sharing,and greater robustness when used to build large knowledge-based ***, translating between English statements and a specific ontology requires skill in knowledgeengineering and an understanding of formal logic and the ontology itself.A knowledge engineer must be familiar with the concepts in the ontology,the fine distinctions between terms,and the specific way the ontology conceptualizes the world. We are developing a ***(Controlled English to Logic Translation),to enable non-programmers to add knowledge expressed in terms of an ontology. CELT is an automatic translation tool to convert controlled English to KIF formulas using ontologies built with the Suggested Upper Merged Ontology(SUMO). WordNet provides a base lexicon and a default preference for word *** do not expect CELT to obviate the need for knowledge engineers but to instead better leverage their time,as current machine translation tools assist professional human translators. CELT uses Discourse Representation Theory to handle the translation of multiple sentences,the use of logical quantifiers,and the resolution of anaphoric *** sentences are parsed using a Definite Clause Grammar augmented with feature grammar extensions. CELT is domain-independent but can be customized for particular domains by providing domain -specific ontologies and *** lexicons can specify both technical terms and domain-specific preferred word senses for common *** translates sentences to assertions and queries for a first-order logic theorem prover.
In this study we present blind equalization techniques for ETSI standard Distributed Speech Recognition (DSR) front-end which compensate for acoustic mismatch caused by input *** DSR front-end employs vector quantizat...
详细信息
In this study we present blind equalization techniques for ETSI standard Distributed Speech Recognition (DSR) front-end which compensate for acoustic mismatch caused by input *** DSR front-end employs vector quantization(VQ) for feature parameter compression so that the mismatch does not only cause a shift of parameters but also increases VQ distortion. Although CMS is one of the most effective methods to compensate for the shift,it can not decrease VQ distortion in *** compensate for the shift and decrease VQ distortion simultaneously,the proposed methods estimate the shift in the input data necessary to match the VQ codebook *** methods do not need the acoustic likelihood which is calculated in a decoder on the server ***, they are applicable to the DSR *** Newspaper Article Sentences database(JNAS) was used for the equalization *** the word error rate(WER) for ETSI standard DSR front-end was 18.6%under acoustic mismatched condition,our propsed method yielded a rate of 12.3%.
The levels-of-processing theory proposes that there are many ways to process and code information. The level of processing adopted will determine the quality of the representation used to store the information in the ...
详细信息
暂无评论