This paper presents an XML-based web data extraction method. This method translates web page into XML document, analyze XML document by using XPath/XSLT, discover web page data pattern and similarity by using XML clus...
详细信息
ISBN:
(数字)9783642340383
ISBN:
(纸本)9783642340376
This paper presents an XML-based web data extraction method. This method translates web page into XML document, analyze XML document by using XPath/XSLT, discover web page data pattern and similarity by using XML clustering algorithm, construct XPath-based dataextraction rule template. This method improves the robustness and versatility of web data extraction system. Experiment result shows that the dataextraction method has high precision and is adaptive to web pages in different sites and with different structures.
A web data extraction system is provided, which adopting web page comparison and analysis within a website. On the basis of treeing and blocking web pages, the data block of web page is retrieved after compared and an...
详细信息
ISBN:
(纸本)9780863419218
A web data extraction system is provided, which adopting web page comparison and analysis within a website. On the basis of treeing and blocking web pages, the data block of web page is retrieved after compared and analyzed, and then the data is extracted via the comparison and judgement of more than one page of the same structure and format so as to actualize in-depth mining of technical information. The system's architecture and composition, and the process of the system tested on the physical property databases of chemistry are elaborated.
web contents usually contain different types of data which are embedded under different complex structures. Existing approaches for extracting data contents from the web are manual wrappers, supervised wrapper inducti...
详细信息
web contents usually contain different types of data which are embedded under different complex structures. Existing approaches for extracting data contents from the web are manual wrappers, supervised wrapper induction, or automatic dataextraction. The webOminer system is an automatic extraction system that attempts to extract diverse heterogeneous web contents by modeling web sites as object oriented schemas. The goal is to generate and integrate various web site object schemas for deeper comparative querying of historical and derived contents of Business to Customer (B2C) such as BestBuy and Future Shop. The current webOMiner system generates and extracts from only one product list page (e. g., computer page) of B2C web sites and still needs to generate and extract from a more comprehensive web site object schemas (e. g., those of Computer, Laptop and Desktop products). The current webOMiner system does not yet handle historical aspects of data objects from different web pages. This thesis extends and advances the webOMiner system to automatically generate a more comprehensive web site object schema, extract and mine structured web contents from different web pages based on objects' patterns similarity matching, and stores the extracted objects in historical object-oriented data warehouse. Approaches to be used include similarity matching of DOM tree tag nodes for identifying data blocks and data regions, automatic Non-Deterministic and Deterministic Finite Automata (NFA and DFA) for generating web site object schemas and content extraction, which contain similar data objects. Experimental results show that our system is effective and able to extract and mine structured data tuples from different webwebsites with 79% recall and 100% precision. The average execution time of our system is 21.8 seconds.
The composition of web APIs provides a great opportunity to web engineers that can reuse existing software components available on the web. Finding the best API, fulfilling a set of user requirements, among the many d...
详细信息
The composition of web APIs provides a great opportunity to web engineers that can reuse existing software components available on the web. Finding the best API, fulfilling a set of user requirements, among the many described on the web is a key step in order to develop an effective web application;however, web engineers have little support in solving this problem due to poor search mechanisms and to the heterogeneity of sources and descriptions. Semantic technologies and matching algorithms provide accurate methods to match user requirements against a set of descriptions. Nonetheless, semantic descriptions of APIs are not available in practice. In this paper, we propose a method to extract information on web APIs published in several web sources and create semantic descriptions that can be then fused to deliver comprehensive descriptions associated with APIs. During the extraction process, we take into account that collected information has different levels of accuracy, currency, and trustworthiness to state a confidence level of the results. The method is based on the evaluation of the quality of the involved sources, the extracted values, and the overall descriptions. The resulting semantic descriptions are then matched with expressive user requirements to address the API selection problem.
There is a tremendous growth in the volume of information available on the internet, digital libraries, new sources and company database or intranets that contain valuable information. Information from World Wide web ...
详细信息
There is a tremendous growth in the volume of information available on the internet, digital libraries, new sources and company database or intranets that contain valuable information. Information from World Wide web has been a source of information which caters for different sectors ranging from social, political and economical spheres for decision making. Such information would be more valuable if it can be available to the end user and other application systems in required formats. This has caused the need for tools to assist users in extracting relevant information in a fast and effective way. We explore an efficient mechanism of extracting webdata through analysis of HTML tags and patterns. HTML constitutes a large percentage of web content. However, much of this content lacks strict structure and proper schema. Additionally, web content has high update frequency and semantic heterogeneity of the information as compared to other format such as XML that are more firm in structure. We have managed to produce a customised generic model that can be used to extract unstructured data from the web and populate it to a database. The main contribution is an automated process for locating, extracting and storing data from HTML web sources. Such data is then available to other application software for analysis and other processing.
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web da...
详细信息
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for webdata integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.
User reviews in forum sites are the important information source for many popular applications (e.g., monitoring and analysis of public opinion), which are usually represented in form of structured records. To the bes...
详细信息
User reviews in forum sites are the important information source for many popular applications (e.g., monitoring and analysis of public opinion), which are usually represented in form of structured records. To the best of our knowledge, little existing work reported in the literature has systemically investigated the problem of extracting user reviews from forum sites. Besides the variety of web page templates, user-generated reviews raise two new challenges. First, the inconsistency of review contents in terms of both the document object model (DOM) tree and visual appearance impair the similarity between review records;second, the review content in a review record corresponds to complicated subtrees rather than single nodes in the DOM tree. To tackle these challenges, we present WeRE - a system that performs automatic user review extraction by employing sophisticated techniques. The review records are extracted from web pages based on the proposed level-weighted tree similarity algorithm first, and then the review contents in records are extracted exactly by measuring the node consistency. Our experimental results based on 20 forum sites indicate that WeRE can achieve high extraction accuracy. (C) 2011 Elsevier Ltd. All rights reserved.
The amount of information available on the web grows at an incredible high rate. Systems and procedures devised to extract these data from web sources already exist, and different approaches and techniques have been i...
详细信息
ISBN:
(纸本)9783642239533;9783642239540
The amount of information available on the web grows at an incredible high rate. Systems and procedures devised to extract these data from web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of webdata mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract webdata may be strictly interconnected with the structure of the data source itself;thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from web sources - the so called web wrappers - which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.
Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various webdata integration applications, more useful news attributes, such as publication dat...
详细信息
ISBN:
(纸本)9783642152450
Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various webdata integration applications, more useful news attributes, such as publication date, author, etc., need to be extracted structured stored for further processing. In this paper, we study the problem of automatically extracting multiple news attributes from news pages. Unlike the traditional ways(e.g. extracting news attributes separately or generating template-dependent wrappers), we propose an automatic, unified approach to extract them based on the visual features of news attributes which includes independent visual features and dependent visual features. The basic idea of our approach is that, first, the candidates of each news attribute are extracted from the news page based on their independent visual features, and then, the true value of each attribute is identified from the candidates based on dependent visual features(the layout relations among news attributes). The extensive experiments using a large number of news pages show that the proposed approach is highly effective and efficient.
webdata-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase i...
详细信息
webdata-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document an be represented as a vector of schema, it can be easily incorporated into existing systems as the fabric for integration.
暂无评论