This paper proposes a Chinese semantic role classification approach on the basis of feature combination. First we define a set of effective basic features. Then a statistics-based feature combination method is develop...
详细信息
Duplicate emails, which exist on the internet widely and are mainly caused by mailing lists, not only waste storage resource but also bring users garbage. In this paper, according to the structure and text feature of ...
详细信息
Duplicate emails, which exist on the internet widely and are mainly caused by mailing lists, not only waste storage resource but also bring users garbage. In this paper, according to the structure and text feature of email, we put forward the concept of Mail-Duplicate-Degree, and in this way the email duplicate is firstly defined. Based on this definition, we develop an algorithm based on clustering to detect duplicate emails. By introducing a hash function provided by TRIE tree to optimize the efficiency, the algorithm gets over the slow processing speed problem existing in traditional clustering methods. Experimental results on large-scale emails have shown that the algorithm has a high precision.
In the research and development of various natural languageprocessing systems, like Q&A system and text-to-scene conversation system, we realize that knowledge of text entailment helps a lot in improving the perf...
详细信息
In the research and development of various natural languageprocessing systems, like Q&A system and text-to-scene conversation system, we realize that knowledge of text entailment helps a lot in improving the performance of the system. Systems with text entailment knowledge will be smarter than those who without entailment knowledge. Currently many research teams are focusing on text entailment, including recognition, generation and extraction. However, entailment extraction is the main method in creating entailment knowledge database. Meanwhile, for text-to-scene conversation system, due to the importance of events in stories, find a method for event entailment extraction is our main goal. In this paper, we proposed a method for extracting event entailment from corpus based on EM iteration, which has not been used before.
Selection of wavelet type, decomposition level and fusing rule is a key problem when wavelet transform is applied to image fusion. 2916 kinds of different fusing methods(54×5×9, including 54 wavelet types, 5...
详细信息
Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer...
详细信息
Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer science teachers in Chinese universities. The target of the system is to automatically extract the useful information from heterogeneous sources and re-organize them into structured format. The system includes 4 main modules: web pages retrieval module, web pages' structure classification module, information extraction module and information updating module. We have successfully applied the system to deal with 107 universities in China which shows the effect of the proposed system.
In our study of Text-to-Scene conversation (TTS), which translates natural language into animations automatically, we realized that event entailment knowledge is useful in generating scenes since the main part of a sc...
详细信息
In our study of Text-to-Scene conversation (TTS), which translates natural language into animations automatically, we realized that event entailment knowledge is useful in generating scenes since the main part of a scene is to show an event. In this paper, we provide some results of our attempt to extract event entailment knowledge. We use entailment chains instead of traditional entailment rules since the sequence of events is a process which make useful in TTS. The result shows that the work is worth to continue to study.
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam i...
详细信息
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
This paper presents a new method to acquire Domain-Ontology structure and examples from semi-structured data sources. Firstly, extract Domain-Ontology structure, including candidate attributes extraction using certain...
详细信息
In this paper, an adaptive feature-weight adjusted image classification method is proposed, which is based on the SVM and the fusion of multiple features. Firstly, classifier was separately constructed for each image ...
详细信息
ISBN:
(纸本)9789898111845
In this paper, an adaptive feature-weight adjusted image classification method is proposed, which is based on the SVM and the fusion of multiple features. Firstly, classifier was separately constructed for each image feature, then automatically learn the weight coefficient of each feature by training data set and the classifiers constructed. At last, a complexity classifier is created by combining the separate classifier and the corresponding weight coefficient. The experiment result showed that our scheme improved the performance of image classification and had adaptive ability comparing with general approach. Moreover, the scheme has certain robustness because of avoiding the impact brought by various dimension of each feature.
This paper proposes a framework for analysis of SMT translations output from a hierarchical phrase decoder. The tree display tool will show the translation process of the SMT model. An interactive operation tool will ...
详细信息
ISBN:
(纸本)9789898111920
This paper proposes a framework for analysis of SMT translations output from a hierarchical phrase decoder. The tree display tool will show the translation process of the SMT model. An interactive operation tool will provide an adjusting mechanism for translation quality improvement. The work will explore automatic or semi-automatic identification and correction of some translation errors based on comparison between hierarchical phrase structures and linguistic phrase structures. Parts of the framework are implemented and primary results introduced.
暂无评论