Over the last few years, web data extraction has gained popularity. Product information on the Ecommerce website floods the internet with big data. Web-based business sites these days have gotten one of the most signi...
详细信息
ISBN:
(纸本)9781728188768
Over the last few years, web data extraction has gained popularity. Product information on the Ecommerce website floods the internet with big data. Web-based business sites these days have gotten one of the most significant hotspots for getting a large amount of relevant data. Wide range of software application designs to extract relevant data from web pages in order to draw in more business. The extracted data can be used for retailer business and data analysis purposes. The web pages on such sites are based on different technologies, and the generated web documents are in structured or unstructured formats. Manually extract such relevant product data and multimedia type Information from the websites is complex and time-consuming. After extraction of data needs to be classified because web content contains unwanted data e.g. design information, advertising content. This paper describes different Procedures for web document classification and extraction.
An unrecognized significance of the web acts as a driving force for the massive and rapid growth of websites in each domain of social life. For making a successful website, it is necessary for developers to embrace ap...
详细信息
An unrecognized significance of the web acts as a driving force for the massive and rapid growth of websites in each domain of social life. For making a successful website, it is necessary for developers to embrace appropriate web testing and evaluation methodology. Some valuable works in the past have striven to appraise the web applications quantitatively. Various parameters have been considered which are again sub-parameterized to measurable indicators. But their weighing criterion has not been appropriately taken into account according to the domain of the website. Also, the relative degrees of interactions among parameters have not been taken into consideration. The work presented in this paper aims at describing a framework, Quality Index Evaluation Method to gauge the design quality of a website in the form of index value. An automated tool has been designed and coded to measure the metrics quantitatively. A weighing technique based on Fuzzy-DEMATEL (Decision Making Trial and Evaluation Laboratory Method) has been applied on these metrics. Fuzzy trapezoidal numbers have been used for assessment of parameters and the final design quality index value. To verify the use of framework in different website domains, it has been exercised on eight academic (four institutional and four digital libraries), five informative and four commercial websites. The results have been validated through the most widely used method in literature, i.e., user judgment. Opinions of users for each website have been quantified and aggregated with fuzzy aggregation technique. Experimental results show that the proposed framework provides accurate and consistent results in very less time.
With the rapid development of Internet, and surge in the amount of information on the Internet, how to accurately and quickly get the information of the users really need, such as the title, links, and pictures, is th...
详细信息
ISBN:
(纸本)9783037858004
With the rapid development of Internet, and surge in the amount of information on the Internet, how to accurately and quickly get the information of the users really need, such as the title, links, and pictures, is the hotspot. This paper proposed a fast web information extraction method based on html parser, this paper validated the effect of the proposed method by extracting commodities information of e-commerce website, the results show that the accuracy of the information extraction by our method is higher than the extraction method based on regular expressions, and the extraction time is greatly shortened.
In order to distinguish and extract the topic information from other interferential information on the BBC news website for the study in social computing, the BBC News Hunter was proposed in this paper. The whole syst...
详细信息
In order to distinguish and extract the topic information from other interferential information on the BBC news website for the study in social computing, the BBC News Hunter was proposed in this paper. The whole system consists of 6 subsystems, respectively named: UI, Control, Download, Analysis,Storage and Log. Numerical experiments show that satisfactory results can be obtained from the BBC news website, whose average accuracy as well as efficiency are acceptable.
This paper introduces a concept of information entropy to judge web-page types, which associates with the method put forward by Roadrunner that pre-purifying topic pages and then using proportional relation to judge t...
详细信息
ISBN:
(纸本)9780769548968
This paper introduces a concept of information entropy to judge web-page types, which associates with the method put forward by Roadrunner that pre-purifying topic pages and then using proportional relation to judge the type of pages. With some typical pages from large website home, the average precision could be reached to 96.7%, which lays foundation for further information extracting work
暂无评论