For increasingly sophisticated use cases end users often need to extract, combine, and aggregate information from various (often dynamically generated) web pages from multiple websites. Current search engines do not f...
详细信息
ISBN:
(纸本)9781450322638
For increasingly sophisticated use cases end users often need to extract, combine, and aggregate information from various (often dynamically generated) web pages from multiple websites. Current search engines do not focus on combining information from various web pages in order to answer the overall information need of the user. Semantic web and Linked data usually take a static view on the data and rely on providers' cooperation. In this paper, we present a novel approach that enables end users to easily extract data from web pages while they browse, store it locally in their browser as well as structure, integrate and search such data. We propose datalog rules for integrating and searching the extracted data. We show how cleaning steps and integration rules can be reused to accelerate the cleaning and integration of extracted data. The proposed approach is implemented as a browser plugin. We present its implementation details and report on our evaluation of the plugin concerning user experience and browsing time saving.
Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForev...
详细信息
Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.
The web has become one of the most important connections to various information resources. The most interesting challenge is how to extract important data from a large number of web pages and transform them to more st...
详细信息
ISBN:
(纸本)9780769536996
The web has become one of the most important connections to various information resources. The most interesting challenge is how to extract important data from a large number of web pages and transform them to more structural, standard and semantic information, which can be queried and analyzed by using matured techniques in database, data warehouse and other fields. We design a wrapper generator by combining the dataextraction technique with XBRL technology based on XBRL-GL taxonomy. The wrapper can transform HTML documents to XML forms according to the analysis of HTML document structure, and then use XPath to locate the data. In this way, we can extract the data accurately and store them in a standard form.
In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep web using a promising proposed technique, called Repetitive Subject Pattern. This ...
详细信息
In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator;when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper present...
详细信息
ISBN:
(纸本)9781450325387
Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents a key component of the BlogForever platform, the web crawler. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple and robust algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. This approach leads to a scalable blog dataextraction process. Furthermore, we show how we integrate a web browser into the web harvesting process in order to support dataextraction from blogs with JavaScript generated content.
In this paper, we discuss the problem that how to realize the Context-Ware Wrapping. We consider the peer sources to facilitate the matching task and enhance a wrapper's extraction accuracy by leverage the peer wr...
详细信息
ISBN:
(纸本)9780769535579
In this paper, we discuss the problem that how to realize the Context-Ware Wrapping. We consider the peer sources to facilitate the matching task and enhance a wrapper's extraction accuracy by leverage the peer wrappers or domain rule. First, we bring in the concept Context-Ware Wrapping. With the problem how to realize it, then we propose a Spiral-Decoding Method to synchronize the extractions by spiral decoding. At last, we give the algorithm to realize it.
High-tech talent is one of the important social resources such as energy and material, and introducing high-tech talent is an important strategy for the development of national science and technology. To extract high-...
详细信息
ISBN:
(纸本)9783037859926
High-tech talent is one of the important social resources such as energy and material, and introducing high-tech talent is an important strategy for the development of national science and technology. To extract high-tech talent information of variety research fields from massive websites. Firstly, we study the principles of web crawler and web data extraction in the paper. Then taking the U.S universities as an example, we propose an intelligent method and procedure which can extract scholars name information from websites. Finally, we apply a classification algorithm to identify Chinese scholars working at overseas and verify the validity of the method in the experimental system. The accuracy of the classification algorithm is higher than 90%, the average accuracy of result information is higher than 77%.
The web has become one of the most important connections to various information *** most interesting challenge is how to extract important data from a large number of web pages and transform them to more structural,st...
详细信息
The web has become one of the most important connections to various information *** most interesting challenge is how to extract important data from a large number of web pages and transform them to more structural,standard and semantic information,which can be queried and analyzed by using matured techniques in database,data warehouse and other *** design a wrapper generator by combining the dataextraction technique with XBRL technology based on XBRL-GL *** wrapper can transform HTML documents to XML forms according to the analysis of HTML document structure,and then use XPath to locate the *** this way,we can extract the data accurately and store them in a standard form.
To extract structured data from a web page with customized requirements,a user labels some DOM elements on the page with attribute *** common features of the labeled elements are utilized to guide the user through the...
详细信息
To extract structured data from a web page with customized requirements,a user labels some DOM elements on the page with attribute *** common features of the labeled elements are utilized to guide the user through the labeling process to minimize user efforts,and are also utilized to retrieve attribute *** turn the attribute values into a structured result,the attribute pattern needs to be *** this purpose,a space-optimized suffix tree called attribute tree is built to transform the document object model(DOM) tree into a simpler form while preserving its useful properties such as attribute sequence *** pattern is induced bottom-up on the attribute tree,and is further used to build the structured *** are conducted and show high performance of our approach in terms of precision,recall and structural correctness.
Extracting data from deep web pages is a challenging problem due to the underlying intricate structures of such pages. A large number of techniques have been proposed to address this problem, but all of them have inhe...
详细信息
ISBN:
(纸本)9781479921041
Extracting data from deep web pages is a challenging problem due to the underlying intricate structures of such pages. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are web-page-programming-language-dependent. The contents on web pages are always displayed regularly for users to browse. There is different ways for deep web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep web pages. In this paper vision-based approach is web page programming-language-independent approach is proposed. This approach utilizes the visual features of the web pages to extract data from deep web pages including data record extraction and data item extraction. Again we also propose a new evaluation measure revision to capture human effort needed to produce exact extraction of data. Our implementation on large set of webdatabases describes the proposed vision-based approach is highly effective for dataextraction from deep web pages.
暂无评论