In the 1960s,the researchers of harbininstitute of technology(HIT)attempted to do relevant research on natural language *** more than 40-year's effort,HIT has already established three research laboratories for C...
详细信息
In the 1960s,the researchers of harbininstitute of technology(HIT)attempted to do relevant research on natural language *** more than 40-year's effort,HIT has already established three research laboratories for Chinese information processing,*** Machine Intelligence and Translation laboratory(MI&T lab),the Intelligent technology and Natural languageprocessinglaboratory(ITNLP)and the Information Retrieval laboratory (IR-lab).At present,it has a well-balanced research team of over 200 persons,and the research interests have extended to languageprocessing,machine translation,text retrieval and other *** institute of technology has accumulated a batch of key techniques and data resources,won many prizes in the technical evaluations at home and *** institute of technology has become one of the most important natural languageprocessing bases for teaching and scientific research in China *** paper gives an introduction to the achievements onNLP in HIT.
The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retr...
详细信息
The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retrieval, information extraction, and speech recognition and synthesis. The important techniques and possible key problems of the respective branch in the near future are discussed as well.
In practical applications of information retrieval, such as the search engine, the query user submitted contains only several keywords usually. This will cause unmatched issues of words between relevant files and the ...
详细信息
In practical applications of information retrieval, such as the search engine, the query user submitted contains only several keywords usually. This will cause unmatched issues of words between relevant files and the user's query, and result in more seriously negative effects on the performance of information retrieval. On the basis of analyzing the process of producing query, this paper puts forward a new method of query expansion based on the model of statistical machine translation. The approach extract related terms between documents and query through statistical machine translation model, then expand the query with them. The experiment on TREC data collection shows that our method achieved 4-17% of the improvement all the time more than the language model method without expanding. Compared to pseudo feedback, our method has the competitive average precision.
A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a ...
详细信息
A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a parse tree is decomposed to deal with the structure divergence and the alignments can be constructed at different levels of recombination of meta-structure (RM). This method can perform the structure mapping across the sub-tree structure between languages. As a result, we get not only the translation for the target language, but sequence of meta-stmctu .re of its parse tree at the same time. Experiments show that the model in the framework of log-linear model has better generative ability and significantly outperforms Pharaoh, a phrase-based system.
Natural language parsing is a task of great importance and extreme difficulty. In this paper, we present a full Chinese parsing system based on a two-stage approach. Rather than identifying all phrases by a uniform mo...
详细信息
Natural language parsing is a task of great importance and extreme difficulty. In this paper, we present a full Chinese parsing system based on a two-stage approach. Rather than identifying all phrases by a uniform model, we utilize a divide and conquer strategy. We propose an effective and fast method based on Markov model to identify the base phrases. Then we make the first attempt to extend one of the best English parsing models i.e. the head-driven model to recognize Chinese complex phrases. Our two-stage approach is superior to the uniform approach in two aspects. First, it creates synergy between the Markov model and the head-driven model. Second, it reduces the complexity of full Chinese parsing and makes the parsing system space and time efficient. We evaluate our approach in PARSEVAL measures on the open test set, the parsing system performances at 87.53% precision, 87.95% recall.
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam i...
详细信息
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
Popularity of blogs and the amount of information in the blogosphere increase so fast that it is difficult for Internet users to search the information they care about. Compared with conventional webs,links in the blo...
详细信息
Popularity of blogs and the amount of information in the blogosphere increase so fast that it is difficult for Internet users to search the information they care about. Compared with conventional webs,links in the blogosphere are more abundant and conversations between bloggers are more fre-quent. This paper proposes a method of ranking bloggers based on link analysis,which can exemplify the characteristics of blogs,and reduce the influence of link spamming. This method can also bring convenience to users to read blogs,and it can supply a new methodology for information retrieval in the blogosphere. To ensure the reliability of the ranking results,some evaluation indicators of the im-portant bloggers are proposed,and the grading results of bloggers using the proposed method is compared with that using other indicators. At last,correlation analysis is used to verify the consistency between the proposed method and the evaluation indicators.
Event recognition and temporal information analysis are important subtasks in information extraction (IE). In this paper, event recognition based on time series characteristics is proposed. In the pipeline of event re...
详细信息
This paper presents a weakly-supervised transfer learning based text categorization method, which does not need to tag new training documents when facing classification tasks in new area. Instead, we can take use of t...
详细信息
The optimization of search results has always been the research hotspot in the area of search engine. More concretely, topic partition by clustering proved to be a good way. However, the clusters, some of which still ...
详细信息
暂无评论