This paper built a system containing the distributed crawling Module, the Database Module and the Analysis Module to collect a large number of objective data, clean the data and realize the business intelligence repre...
详细信息
This paper built a system containing the distributed crawling Module, the Database Module and the Analysis Module to collect a large number of objective data, clean the data and realize the business intelligence representation. The distributed crawling system includes the distributed crawling Module building from Hadoop, the Database Module by SQL Server and the Analysis Module by the SAS system. The first two modules support a distributed way to collect and convert non-structural data into structural data for the last module processing. Then, this paper extends research fields from theory to application about B&R, constructs rating system from the perspective of political risk, economic risk, financial risk, business environment risk and legal risk, establishes a 140 rating index, collects 46,200 sample data, and adopts Model of Principal Components Analysis, Analytic Hierarchy Process and Efficacy Function to access "the Belt & Road Initiative" 66 countries for 5 consecutive years of export credit insurance in the country risk rating. This paper also gives a detail explanation on 2015 rating results, showing that, Singapore wins the highest credit rating among the country;the credit rating of Latvia, Estonia, Slovakia, Turkey, Malaysia, Russia, Thailand and other countries is very high;Afghanistan, Ukraine, Laos, Iran, Arabia, Republic of Syria, Iraq, Burma, Republic of Yemen, East Timor and other countries with poor credit ratings. The conclusion is consistent with the domestic and overseas well-known rating agencies.
In order to solve task scheduling and load balancing problems of distributed search engines, a GNP-based scheduling strategy for distributed crawling and a load balancing method are proposed in this paper. Internet di...
详细信息
ISBN:
(纸本)9780769538174
In order to solve task scheduling and load balancing problems of distributed search engines, a GNP-based scheduling strategy for distributed crawling and a load balancing method are proposed in this paper. Internet distance estimating mechanism is adopted as a replacement for large-scale network distance measurement, which not only improves response speed of the system, but also reduces loads on WAN caused by the system. Through deploying crawling nodes at WANs, we built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system's performance.
distributed crawling is able to overcome important limitations of the traditional single-sourced web crawling systems. However, the optimal benefit of distributed crawling is usually limited to the sites hosting the c...
详细信息
ISBN:
(纸本)1932415467
distributed crawling is able to overcome important limitations of the traditional single-sourced web crawling systems. However, the optimal benefit of distributed crawling is usually limited to the sites hosting the crawlers, the rest of the URLs are by large randomly distributed to the various crawlers. In this work, we propose a location-aware method, called IPMicra, that utilizes an IP address hierarchy, and allows crawling of links in a near optimal location aware manner Our proposal outperforms earlier distributed crawling schemes by requiring one order of magnitude less time for crawling of the same set of sites.
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources to crawl data from the rapidly growing and changing Web. The crawling process should be performed continu...
详细信息
ISBN:
(纸本)9780769550602
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources to crawl data from the rapidly growing and changing Web. The crawling process should be performed continuously to maintain up-to-date data. This paper develops a new approach to speed up the crawling process on a multi-core processor by utilizing the concept of virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines (VMs), which can concurrently perform different crawling tasks on different initial data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. The speedup factor achieved by the VM-based crawler over no virtualization crawler, for crawling various numbers of documents, is estimated. Also, the effect of number of VMs on the speedup factor is investigated.
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources to crawl data from the rapidly growing and changing Web. The crawling process should be performed continu...
详细信息
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources to crawl data from the rapidly growing and changing Web. The crawling process should be performed continuously to maintain up-to-date data. This paper develops a new approach to speed up the crawling process on a multi-core processor by utilizing the concept of virtualization. In this approach, the multi-core processor is divided into a number of virtual-machines(VMs), which can concurrently perform different crawling tasks on different initial data. It presents a description, implementation, and evaluation of a VM-based distributed Web crawler. The speedup factor achieved by the VM-based crawler over no virtualization crawler, for crawling various numbers of documents, is estimated. Also, the effect of number of VMs on the speedup factor is investigated.
A Web crawler is an important component of the Web search *** demands large amount of hardware resources to crawl data from the rapidly growing and changing *** crawling process should be performed continuously to mai...
详细信息
A Web crawler is an important component of the Web search *** demands large amount of hardware resources to crawl data from the rapidly growing and changing *** crawling process should be performed continuously to maintain up-to-date *** paper develops a new approach to speed up the crawling process on a multi-core processor by utilizing the concept of *** this approach,the multi-core processor is divided into a number of virtual-machines(VMs),which can concurrently perform different crawling tasks on different initial *** presents a description,implementation,and evaluation of a VM-based distributed Web *** speedup factor achieved by the VM-based crawler over no virtualization crawler,for crawling various numbers of documents,is ***,the effect of number of VMs on the speedup factor is investigated.
暂无评论