Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules f...
详细信息
ISBN:
(纸本)9781605588896
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machinelearning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework. Copyright 2010 ACM.
Tagging plays an important role in many recent websites. Recommender systems can help to suggest a user the tags he might want to use for tagging a specific item. Factorization models based on the Tucker Decomposition...
详细信息
ISBN:
(纸本)9781605588896
Tagging plays an important role in many recent websites. Recommender systems can help to suggest a user the tags he might want to use for tagging a specific item. Factorization models based on the Tucker Decomposition (TD) model have been shown to provide high quality tag recommendations outperforming other approaches like PageRank, FolkRank, collaborative filtering, etc. The problem with TD models is the cubic core tensor resulting in a cubic runtime in the factorization dimension for prediction and learning. In this paper, we present the factorization model PITF (Pairwise Interaction Tensor Factorization) which is a special case of the TD model with linear runtime both for learning and prediction. PITF explicitly models the pairwise interactions between users, items and tags. The model is learned with an adaption of the Bayesian personalized ranking (BPR) criterion which originally has been introduced for item recommendation. Empirically, we show on real world datasets that this model outperforms TD largely in runtime and even can achieve better prediction quality. Besides our lab experiments, PITF has also won the ECML/PKDD Discovery Challenge 2009 for graph-based tag recommendation. Copyright 2010 ACM.
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training ...
详细信息
ISBN:
(纸本)9781605588896
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result. Copyright 2010 ACM.
In clustering methods, the estimation of the optimal number of clusters is significant for subsequent analysis. As a simple clustering method, the fuzzy c-means algorithm (FCM) has been widely discussed and applied in...
详细信息
Cross-document coreference resolution plays an import part in the filed of natural language processing (NLP). It captures the ability of gathering documents for information about a certain entity. Most previous algori...
详细信息
ISBN:
(纸本)9780769539232
Cross-document coreference resolution plays an import part in the filed of natural language processing (NLP). It captures the ability of gathering documents for information about a certain entity. Most previous algorithms identify the underlying entity of a given document depending on the original text, which is unreliable if the original text contains multiple parts of different themes. In this paper, we propose a cross-document coreference resolution algorithm based on automatic text summary instead of the original text. In our approach, we extract query-specific and informative-indicative summary from the original text by using Hobbs algorithm and measure the similarity between two summaries. This automatic text summary-based cross-document coreference resolution (ATSCDCR) system is effective in disambiguating different entities of the same mention name and identifying the same entity of different mention names. The results from our experiments show that the macro average of ATSCDCR system is up to 73.16% and the micro average of ATSCDCR system is 67.34%.
The proceedings contain 25 papers. The special focus in this conference is on Agents, Knowledge Acquisition, datamining, machinelearning, Neural Nets and Intelligent Systems Engineering. The topics include: Involvin...
ISBN:
(纸本)3642152856
The proceedings contain 25 papers. The special focus in this conference is on Agents, Knowledge Acquisition, datamining, machinelearning, Neural Nets and Intelligent Systems Engineering. The topics include: Involving the human user in the control architecture of an autonomous agent;transferring hopscotch from the schoolyard to the classroom;social relationships as a means for identifying an individual in large information spaces;a methodology for inducing pre-pruned modular classification rules;enhancement of infrequent purchased product recommendation using datamining techniques;a machinelearning approach to predicting winning patterns in track cycling omnium;learning motor control by dancing YMCA;analysis and comparison of probability transformations for fusing sensors with uncertain detection performance;a case-based approach to business process monitoring;a survey on the dynamic scheduling problem in astronomical observations;combining determinism and intuition through univariate decision strategies for target detection from multi-sensors;a UML profile oriented to the requirements modeling in intelligent tutoring systems projects;learning by collaboration in intelligent autonomous systems;full text search engine as scalable k-nearest neighbor recommendation system;following a developing story on the web;computer-aided estimation for the risk of development of gastric cancer by image processing;intelligent hybrid architecture for tourism services;knowledge-based geo-risk assessment for an intelligent measurement system and case-based decision support in time dependent medical domains.
Feature selections have seen growing importance placed on statistics, patternrecognition, machinelearning and datamining. Researchers have demonstrated the interest in the methods for improving the performance of t...
详细信息
This paper details the preliminary research into modeling the behavior of Electronic Gaming machines (EGM) for the task of proactive fault diagnostics. The EGMs operate within a state space and therefore their behavio...
详细信息
ISBN:
(纸本)9783642130588
This paper details the preliminary research into modeling the behavior of Electronic Gaming machines (EGM) for the task of proactive fault diagnostics. The EGMs operate within a state space and therefore their behavior was modeled, using supervised learning, as the frequency at which a given machine is operating in a particular state. The results indicated that EGMs did exhibit measurably different behavior when they were about to experience a fault and these relationships were modeled effectively by several algorithms.
In this paper, a novel fuzzy support vector machine based image watermarking scheme is proposed. Since the application of support vector machine in the process of watermarking technology is only a simple classificatio...
详细信息
User Navigation Behavior mining (UNBM) mainly studies the problems of extracting the interesting user access patterns from user access sequences (UAS), which are usually used for user access prediction and web page re...
详细信息
暂无评论