检索结果-内蒙古大学图书馆

XML-Based web data Pattern Discovery and extraction 1

3rd International Conference on Information Computing and Applications (ICICA 2012)

作者： Jia, Rui Xu, Shicheng Peng, Chengbao Neusoft Corp Shenyang Peoples R China

ISBN: (数字)9783642340383

ISBN: (纸本)9783642340376

This paper presents an XML-based web data extraction method. This method translates web page into XML document, analyze XML document by using XPath/XSLT, discover web page data pattern and similarity by using XML clustering algorithm, construct XPath-based data extraction rule template. This method improves the robustness and versatility of web data extraction system. Experiment result shows that the data extraction method has high precision and is adaptive to web pages in different sites and with different structures.

关键词： web data extraction XML Clustering Pattern Discovery

来源：评论

学校读者我要写书评

暂无评论

STUDY ON web data extraction

STUDY ON WEB DATA EXTRACTION

引用

3rd China-Ireland International Conference on Information and Communications Technologies

作者： Li, Wensheng Shan, Lan Zhao, Ying Zhang, Juhong Beijing Univ Posts & Telecommun Sch Comp Sci & Technol Beijing 100876 Peoples R China Beijing Univ Chem Technol Sch Informat Sci & Technol Beijing 100029 Peoples R China

ISBN: (纸本)9780863419218

A web data extraction system is provided, which adopting web page comparison and analysis within a website. On the basis of treeing and blocking web pages, the data block of web page is retrieved after compared and analyzed, and then the data is extracted via the comparison and judgement of more than one page of the same structure and format so as to actualize in-depth mining of technical information. The system's architecture and composition, and the process of the system tested on the physical property databases of chemistry are elaborated.

关键词： web data extraction database web data gathering web page comparison and analysis

来源：评论

学校读者我要写书评

暂无评论

Comparative Mining of Multiple web data Source Contents with Object Oriented Model

Comparative Mining of Multiple Web Data Source Contents with...

引用

作者： Yanal Alahmad University of Windsor

学位级别：硕士

web contents usually contain different types of data which are embedded under different complex structures. Existing approaches for extracting data contents from the web are manual wrappers, supervised wrapper induction, or automatic data extraction. The webOminer system is an automatic extraction system that attempts to extract diverse heterogeneous web contents by modeling web sites as object oriented schemas. The goal is to generate and integrate various web site object schemas for deeper comparative querying of historical and derived contents of Business to Customer (B2C) such as BestBuy and Future Shop. The current webOMiner system generates and extracts from only one product list page (e. g., computer page) of B2C web sites and still needs to generate and extract from a more comprehensive web site object schemas (e. g., those of Computer, Laptop and Desktop products). The current webOMiner system does not yet handle historical aspects of data objects from different web pages. This thesis extends and advances the webOMiner system to automatically generate a more comprehensive web site object schema, extract and mine structured web contents from different web pages based on objects' patterns similarity matching, and stores the extracted objects in historical object-oriented data warehouse. Approaches to be used include similarity matching of DOM tree tag nodes for identifying data blocks and data regions, automatic Non-Deterministic and Deterministic Finite Automata (NFA and DFA) for generating web site object schemas and content extraction, which contain similar data objects. Experimental results show that our system is effective and able to extract and mine structured data tuples from different web websites with 79% recall and 100% precision. The average execution time of our system is 21.8 seconds.

关键词： web content mining Object-Oriented web data extraction Wrapper induction Frequent Objects Mining data warehouse DOM-Tree

来源：评论

学校读者我要写书评

暂无评论

QUALITY-DRIVEN extraction, FUSION AND MATCHMAKING OF SEMANTIC web API DESCRIPTIONS

引用

JOURNAL OF web ENGINEERING 2012年第3期11卷 247-268页

作者： Panziera, Luca Comerio, Marco Palmonari, Matteo De Paoli, Flavio Batini, Carlo Univ Milano Bicocca DISCo I-20126 Milan Italy

The composition of web APIs provides a great opportunity to web engineers that can reuse existing software components available on the web. Finding the best API, fulfilling a set of user requirements, among the many described on the web is a key step in order to develop an effective web application;however, web engineers have little support in solving this problem due to poor search mechanisms and to the heterogeneity of sources and descriptions. Semantic technologies and matching algorithms provide accurate methods to match user requirements against a set of descriptions. Nonetheless, semantic descriptions of APIs are not available in practice. In this paper, we propose a method to extract information on web APIs published in several web sources and create semantic descriptions that can be then fused to deliver comprehensive descriptions associated with APIs. During the extraction process, we take into account that collected information has different levels of accuracy, currency, and trustworthiness to state a confidence level of the results. The method is based on the evaluation of the quality of the involved sources, the extracted values, and the overall descriptions. The resulting semantic descriptions are then matched with expressive user requirements to address the API selection problem.

关键词： service matchmaking semantic matching quality assessments web data extraction web data fusion

来源：评论

学校读者我要写书评

暂无评论

HOLISTIC APPROACH FOR EFFICIENT extraction OF web data

HOLISTIC APPROACH FOR EFFICIENT EXTRACTION OF WEB DATA

引用

作者： MALEKIA DIDAS University of Nairobi

学位级别：硕士

There is a tremendous growth in the volume of information available on the internet, digital libraries, new sources and company database or intranets that contain valuable information. Information from World Wide web has been a source of information which caters for different sectors ranging from social, political and economical spheres for decision making. Such information would be more valuable if it can be available to the end user and other application systems in required formats. This has caused the need for tools to assist users in extracting relevant information in a fast and effective way. We explore an efficient mechanism of extracting web data through analysis of HTML tags and patterns. HTML constitutes a large percentage of web content. However, much of this content lacks strict structure and proper schema. Additionally, web content has high update frequency and semantic heterogeneity of the information as compared to other format such as XML that are more firm in structure. We have managed to produce a customised generic model that can be used to extract unstructured data from the web and populate it to a database. The main contribution is an automated process for locating, extracting and storing data from HTML web sources. Such data is then available to other application software for analysis and other processing.

关键词： web data extraction structured data semi structured and unstructured data

来源：评论

学校读者我要写书评

暂无评论

Extracting multiple news attributes based on visual features

引用

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS 2012年第2期38卷 465-486页

作者： Liu, Wei Yan, Hualiang Xiao, Jianguo Inst Sci & Tech Informat China Beijing 100038 Peoples R China Peking Univ Inst Comp Sci & Technol Beijing 100871 Peoples R China

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.

关键词： web data extraction News article Visual feature Markov Logic Networks

来源：评论

学校读者我要写书评

暂无评论

Automatically extracting user reviews from forum sites

引用

COMPUTERS & MATHEMATICS WITH APPLICATIONS 2011年第7期62卷 2779-2792页

作者： Liu, Wei Yan, Hualiang Xiao, Jianguo Peking Univ Inst Comp Sci & Technol Beijing 100871 Peoples R China Inst Sci & Tech Informat China Beijing 100038 Peoples R China

User reviews in forum sites are the important information source for many popular applications (e.g., monitoring and analysis of public opinion), which are usually represented in form of structured records. To the best of our knowledge, little existing work reported in the literature has systemically investigated the problem of extracting user reviews from forum sites. Besides the variety of web page templates, user-generated reviews raise two new challenges. First, the inconsistency of review contents in terms of both the document object model (DOM) tree and visual appearance impair the similarity between review records;second, the review content in a review record corresponds to complicated subtrees rather than single nodes in the DOM tree. To tackle these challenges, we present WeRE - a system that performs automatic user review extraction by employing sophisticated techniques. The review records are extracted from web pages based on the proposed level-weighted tree similarity algorithm first, and then the review contents in records are extracted exactly by measuring the node consistency. Our experimental results based on 20 forum sites indicate that WeRE can achieve high extraction accuracy. (C) 2011 Elsevier Ltd. All rights reserved.

关键词： web reviews Structured data web data extraction

来源：评论

学校读者我要写书评

暂无评论

Intelligent Self-repairable web Wrappers

Intelligent Self-repairable Web Wrappers

引用

12th International Conference of the Italian-Association-for-Artificial-Intelligence on Advances in Artificial Intelligence

作者： Ferrara, Emilio Baumgartner, Robert Univ Messina Dept Math V Stagno dAlcontres 31 I-98166 Messina Italy Lixto Software GmbH A-1040 Vienna Austria

ISBN: (纸本)9783642239533;9783642239540

The amount of information available on the web grows at an incredible high rate. Systems and procedures devised to extract these data from web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract web data may be strictly interconnected with the structure of the data source itself;thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from web sources - the so called web wrappers - which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.

关键词： web data extraction wrappers automatic adaptation

来源：评论

学校读者我要写书评

暂无评论

A Unified Approach for Extracting Multiple News Attributes from News Pages

A Unified Approach for Extracting Multiple News Attributes f...

引用

11th Pacific Rim International Conference on Artificial Intelligence

作者： Liu, Wei Yan, Hualiang Yang, Jianwu Xiao, Jianguo Peking Univ Inst Comp Sci & Technol Beijing 100871 Peoples R China

ISBN: (纸本)9783642152450

Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various web data integration applications, more useful news attributes, such as publication date, author, etc., need to be extracted structured stored for further processing. In this paper, we study the problem of automatically extracting multiple news attributes from news pages. Unlike the traditional ways(e.g. extracting news attributes separately or generating template-dependent wrappers), we propose an automatic, unified approach to extract them based on the visual features of news attributes which includes independent visual features and dependent visual features. The basic idea of our approach is that, first, the candidates of each news attribute are extracted from the news page based on their independent visual features, and then, the true value of each attribute is identified from the candidates based on dependent visual features(the layout relations among news attributes). The extensive experiments using a large number of news pages show that the proposed approach is highly effective and efficient.

关键词： web data extraction news attribute visual feature

来源：评论

学校读者我要写书评

暂无评论

web data extraction based on structural similarity

引用

KNOWLEDGE AND INFORMATION SYSTEMS 2005年第4期8卷 438-461页

作者： Li, Z Ng, WK Sun, AX Nanyang Technol Univ Ctr Adv Informat Syst Sch Comp Engn Singapore Singapore

web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document an be represented as a vector of schema, it can be easily incorporated into existing systems as the fabric for integration.

关键词： classification clustering framework web data extraction

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：