In the theme crawler. the shark-search algorithm is insufficient to consider the global web page. In this paper, the PageRank algorithm is used to calculate the URL's authority to make up for this shortcoming, and...
详细信息
ISBN:
(纸本)9781509012565
In the theme crawler. the shark-search algorithm is insufficient to consider the global web page. In this paper, the PageRank algorithm is used to calculate the URL's authority to make up for this shortcoming, and shark-PageRank algorithm, which adopts the anchor text, the context near the anchor text and authoritative value of web page to measure the value of the URL, is proposed in this paper. The experiment results show that the new algorithm improves the speed and accuracy of the query, and the algorithm has good stability and scalability.
In the theme crawler,the shark-search algorithm is insufficient to consider the global web *** this paper,the PageRank algorithm is used to calculate the URL's authority to make up for this shortcoming,and shark-P...
详细信息
ISBN:
(纸本)9781509012572
In the theme crawler,the shark-search algorithm is insufficient to consider the global web *** this paper,the PageRank algorithm is used to calculate the URL's authority to make up for this shortcoming,and shark-PageRank algorithm,which adopts the anchor text,the context near the anchor text and authoritative value of web page to measure the value of the URL,is proposed in this *** experiment results show that the new algorithm improves the speed and accuracy of the query,and the algorithm has good stability and scalability.
Focused crawling is increasingly seen as a solution to address the scalability limitations of existing general-purpose search engines, by traversing the Web to only gather pages that are relevant to a specific topic. ...
详细信息
Focused crawling is increasingly seen as a solution to address the scalability limitations of existing general-purpose search engines, by traversing the Web to only gather pages that are relevant to a specific topic. How to predict the relevance of the unvisited pages pointed to by candidate URLs in the crawling frontier to a given topic is a key issue in the design of focused crawlers. In this paper, we propose a novel approach based on multiple relevance prediction strategies to address this problem. For cross-language crawling, we first introduce a hierarchical taxonomy to describe topics in both English and Chinese. We then present a formal description of the relevance predicting process and discuss four strategies that make use of page contents, anchor texts, URL addresses and link types of Web pages, respectively, to evaluate the relevance more accurately, in which we propose a particular strategy using Chinese URL addresses to estimate the relevance of cross-language Web pages. Finally, we get a new focused crawling algorithm (FCMRPS, Focused Crawling based on Multiple Relevance Prediction Strategies) based on the combination of these strategies and shark-search, which is a classic focused crawling algorithm. Experiments show that the FCMRPS is more effective than the traditional algorithms, namely Breadth-First, Best-First and shark-search, in terms of precision and sum of information. (C) 2008 Elsevier Ltd. All rights reserved.
暂无评论