Annotating linguistic data is often a complex, time consuming and expensive endeavor. Even with strict annotation guidelines, human subjects often deviate in their analyses, each bring different biases, interpretation...
详细信息
Backbone nodes in public opinion diffusion could help people understand how it spreads. Previous work relies on the fact that how the opinion diffuses across time, which shows disappointing results. This paper present...
详细信息
ISBN:
(纸本)9781479939046
Backbone nodes in public opinion diffusion could help people understand how it spreads. Previous work relies on the fact that how the opinion diffuses across time, which shows disappointing results. This paper presents a novel method for identifying backbone nodes in public opinion diffusion, which can be applied to different platforms. We take Sina microblog as an example platform. Besides traditional factors, our model takes personal contribution degree and physical contribution degree into consideration by estimating personal features and diffusion scale respectively. Finally, we employ a visual graph to a person's role in public opinion diffusion intuitively. Experimental result shows that this method performs well in identifying backbone nodes.
As a popular Internet information exchange platform, Micro-Blog like Twitter attracts a large amount of users to share information through short and noisy messages. In this paper, we aim to discover Micro-Blog users...
详细信息
ISBN:
(纸本)9781479939046
As a popular Internet information exchange platform, Micro-Blog like Twitter attracts a large amount of users to share information through short and noisy messages. In this paper, we aim to discover Micro-Blog users' interest using topic model. In the topic model, users' metadata such as labels are taken as new features and been put into user document which will be used to infer user's interest. Experimental results indicate that this method gives satisfying user interest and is capable for reality project. This paper also introduce two applications based on user interest detected before: 1) keywords extraction based on interest (We calculate word entropy using word topic distribution as new feature). 2) User clustering based on user interest.
It is well known that the statistical machine translation (SMT) performance suffers when a model is applied to out-of-domain data. It is also known that the more similar the test domain and the training domain are, th...
详细信息
ISBN:
(纸本)9781467352512
It is well known that the statistical machine translation (SMT) performance suffers when a model is applied to out-of-domain data. It is also known that the more similar the test domain and the training domain are, the more efficient the training data are for SMT performance. Hence, measuring the similarity of domains is an important task to select appropriate training data. The most widely used method uses the cosine similarity function and word frequency. The lack of exploring other approaches motivates us to propose and compare several similarity measures. Aiming for better SMT performance, we compared 10 similarity measures, which are a combination of 2 feature representations and 5 similarity functions. The results show that using the relative word frequency as the feature representation and using the skew divergence as the similarity function performs the best amongst the 10 measures and outperforms random data selection.
Named Entity Recognition (NER) is one of the most important problems in Natural languageprocessing (NLP). NER also has a broad prospect for application and important research value. There are a lot of methods and tec...
详细信息
ISBN:
(纸本)9781479902590
Named Entity Recognition (NER) is one of the most important problems in Natural languageprocessing (NLP). NER also has a broad prospect for application and important research value. There are a lot of methods and technology to solve NER problem. In this paper, for a specific application background, a new multi-pattern fusion based semi-supervised NER method is proposed. We use soft-matching method in entity internal pattern first. Then through bootstrapping process in the training corpus, we get an entity external pattern. Finally we use fusion internal and external pattern method to complete the named entity recognition. Experiments on Chinese weapon names, from People's Daily corpus and some military news articles were performed. They showed when the internal characteristic is significant and training corpus has a higher similarity with test corpus, this method performs better than soft matching method and external pattern based bootstrapping method, improving the named entity recognition precision by 18.2%.
The traditional search engines rarely consider features of the document set, so the retrieval results are not so satisfactory after new documents are added into the retrieval system. In this paper we combine the featu...
详细信息
ISBN:
(纸本)9781479902590
The traditional search engines rarely consider features of the document set, so the retrieval results are not so satisfactory after new documents are added into the retrieval system. In this paper we combine the features of document set with traditional retrieval models and propose an incremental learning strategy to optimize the retrieval results. We got a feature thesaurus by extracting the document set. Then we collected some new features from the newly added documents and refreshed the feature thesaurus. Finally, the search results were reordered according to how well they matched the feature thesaurus with a query. Several parts of experiments show that this method averagely rises by 9.4% in precision, 14.9% in MAP, 4.6% in DCG towards the top 10 results than traditional retrieval means, which means that it processes better while making a query, even better while querying to the newly added documents, and faster while locating the required information.
This paper presents a weakly-supervised transfer learning based text categorization method, which does not need to tag new training documents when facing classification tasks in new area. Instead, we can take use of t...
详细信息
This paper describes our syllable-based phrase transliteration system for the NEWS 2012 shared task on English-Chinese track and its back. Grapheme-based Transliteration maps the character(s) in the source side to the...
详细信息
Chinese Word segmenter is the basis for all subsequent applications of natural languageprocessing. The Corpus-based statistic method has become the predominant method. However, the training corpora are not enough esp...
详细信息
暂无评论