检索结果-内蒙古大学图书馆

22nd ACM International Conference on Information and Knowledge Management (CIKM)

作者： Agarwal, Sudhir Genesereth, Michael Stanford Univ Stanford Comp Sci Dept 353 Serra Mall Stanford CA 94305 USA

ISBN: (纸本)9781450322638

For increasingly sophisticated use cases end users often need to extract, combine, and aggregate information from various (often dynamically generated) web pages from multiple websites. Current search engines do not focus on combining information from various web pages in order to answer the overall information need of the user. Semantic web and Linked data usually take a static view on the data and rely on providers' cooperation. In this paper, we present a novel approach that enables end users to easily extract data from web pages while they browse, store it locally in their browser as well as structure, integrate and search such data. We propose datalog rules for integrating and searching the extracted data. We show how cleaning steps and integration rules can be reused to accelerate the cleaning and integration of extracted data. The proposed approach is implemented as a browser plugin. We present its implementation details and report on our evaluation of the plugin concerning user experience and browsing time saving.

关键词： web data extraction web data Integration and Search

来源：评论

学校读者我要写书评

暂无评论

A Scalable Approach to Harvest Modern weblogs

引用

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS 2015年第2期24卷 1540005-1540005页

作者： Banos, Vangelis Blanvillain, Olivier Kasioumis, Nikos Manolopoulos, Yannis Aristotle Univ Thessaloniki Dept Informat Thessaloniki 54124 Greece Ecole Polytech Fed Lausanne CH-1015 Lausanne Switzerland CERN European Org Nucl Res CH-1211 Geneva 23 Switzerland

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

关键词： Blog crawler web data extraction wrapper generation interoperability

来源：评论

学校读者我要写书评

暂无评论

web data extraction Based on XBRL-GL Taxonomy

Web Data Extraction Based on XBRL-GL Taxonomy

引用

Asia-Pacific Conference on Information Processing (APCIP 2009)

作者： Luo, Hanyang Gao, Jinling Luo, Hanyang Shenzhen Univ Coll Management Shenzhen Peoples R China

ISBN: (纸本)9780769536996

The web has become one of the most important connections to various information resources. The most interesting challenge is how to extract important data from a large number of web pages and transform them to more structural, standard and semantic information, which can be queried and analyzed by using matured techniques in database, data warehouse and other fields. We design a wrapper generator by combining the data extraction technique with XBRL technology based on XBRL-GL taxonomy. The wrapper can transform HTML documents to XML forms according to the analysis of HTML document structure, and then use XPath to locate the data. In this way, we can extract the data accurately and store them in a standard form.

关键词： web data extraction XBRL-GL taxonomy XML XPath

来源：评论

学校读者我要写书评

暂无评论

Information extraction for deep web using repetitive subject pattern

引用

WORLD WIDE web-INTERNET AND web INFORMATION SYSTEMS 2014年第5期17卷 1109-1139页

作者： Thamviset, Wachirawut Wongthanavasu, Sartra Khon Kaen Univ Cellular Automata & Knowledge Engn CAKE Lab Machine Learning & Intelligent Syst MLIS Lab Dept Comp SciFac Sci Khon Kaen 40002 Thailand

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator;when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

关键词： Information extraction web data extraction web content mining Subject pattern Wrapper induction Unsupervised learning

来源：评论

学校读者我要写书评

暂无评论

BlogForever Crawler: Techniques and Algorithms to Harvest Modern weblogs 14

BlogForever Crawler: Techniques and Algorithms to Harvest Mo...

引用

4th International Conference on web Intelligence, Mining and Semantics(WIMS)

作者： Blanvillain, Olivier Kasioumis, Nikos Banos, Vangelis Ecole Polytech Fed Lausanne CH-1015 Lausanne Switzerland European Org Nucl Res CERN CH-1211 Geneva 23 Switzerland Aristotle Univ Thessaloniki Dept Informat Thessaloniki Greece

ISBN: (纸本)9781450325387

Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents a key component of the BlogForever platform, the web crawler. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple and robust algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. This approach leads to a scalable blog data extraction process. Furthermore, we show how we integrate a web browser into the web harvesting process in order to support data extraction from blogs with JavaScript generated content.

关键词： Blog crawler web data extraction wrapper generation

来源：评论

学校读者我要写书评

暂无评论

A Spiral-Decoding Method for web data extraction

A Spiral-Decoding Method for Web Data Extraction

引用

1st International Workshop on Education Technology and Computer Science

作者： Wan, Lirong Wang, Xinjun Chen, Congcong Shandong Univ Sch Comp Sci & Technol Jinan 250100 Peoples R China

ISBN: (纸本)9780769535579

In this paper, we discuss the problem that how to realize the Context-Ware Wrapping. We consider the peer sources to facilitate the matching task and enhance a wrapper's extraction accuracy by leverage the peer wrappers or domain rule. First, we bring in the concept Context-Ware Wrapping. With the problem how to realize it, then we propose a Spiral-Decoding Method to synchronize the extractions by spiral decoding. At last, we give the algorithm to realize it.

关键词： web data extraction Context-Ware Wrapping Deep web

来源：评论

学校读者我要写书评

暂无评论

Automatically Extracting University Scholar Names Information and Classification

Automatically Extracting University Scholar Names Informatio...

引用

4th International Conference on Frontiers of Manufacturing and Design Science (ICFMD 2013)

作者： Su, Chang Jia, Wenqiang Shang, Fengjun Chongqing Univ Posts & Telecommun Coll Comp Sci & Technol Chongqing 400065 Peoples R China

ISBN: (纸本)9783037859926

High-tech talent is one of the important social resources such as energy and material, and introducing high-tech talent is an important strategy for the development of national science and technology. To extract high-tech talent information of variety research fields from massive websites. Firstly, we study the principles of web crawler and web data extraction in the paper. Then taking the U.S universities as an example, we propose an intelligent method and procedure which can extract scholars name information from websites. Finally, we apply a classification algorithm to identify Chinese scholars working at overseas and verify the validity of the method in the experimental system. The accuracy of the classification algorithm is higher than 90%, the average accuracy of result information is higher than 77%.

关键词： web crawler web data extraction name recognition name classification

来源：评论

学校读者我要写书评

暂无评论

web data extraction Based on XBRL-GL Taxonomy

Web Data Extraction Based on XBRL-GL Taxonomy

引用

2009 Asia-Pacific Conference on Information Processing

作者： Hanyang Luo,Jinling Gao College of Management Shenzhen University Shenzhen,P.R.China Shenzhen Graduate School Harbin Institute of Technology Shenzhen,P.R.China

The web has become one of the most important connections to various information *** most interesting challenge is how to extract important data from a large number of web pages and transform them to more structural,standard and semantic information,which can be queried and analyzed by using matured techniques in database,data warehouse and other *** design a wrapper generator by combining the data extraction technique with XBRL technology based on XBRL-GL *** wrapper can transform HTML documents to XML forms according to the analysis of HTML document structure,and then use XPath to locate the *** this way,we can extract the data accurately and store them in a standard form.

关键词： web data extraction XBRL-GL taxonomy XML XPath

来源：评论

学校读者我要写书评

暂无评论

Creating customized data services from web pages

引用

High Technology Letters 2013年第2期19卷 203-207页

作者：季光 Wang Guiling Han Yanbo Institute of Computing Technology Chinese Academy of Sciences Graduate University of Chinese Academy of Sciences Research Center for Cloud Computing North China University of Technology

To extract structured data from a web page with customized requirements,a user labels some DOM elements on the page with attribute *** common features of the labeled elements are utilized to guide the user through the labeling process to minimize user efforts,and are also utilized to retrieve attribute *** turn the attribute values into a structured result,the attribute pattern needs to be *** this purpose,a space-optimized suffix tree called attribute tree is built to transform the document object model(DOM) tree into a simpler form while preserving its useful properties such as attribute sequence *** pattern is induced bottom-up on the attribute tree,and is further used to build the structured *** are conducted and show high performance of our approach in terms of precision,recall and structural correctness.

关键词： web data extraction structured data user labeling customization data service

来源：评论

学校读者我要写书评

暂无评论

Using Visual Clues Concept for Extracting Main data from Deep web Pages

Using Visual Clues Concept for Extracting Main Data from Dee...

引用

International Conference on Electronic Systems, Signal Processing and Computing

作者： Satish J. Pusdekar Shaikh. Phiroj Chhaware Dept. of Computer Technology Priyadarshani College of Engineering Hingana Road Nagpur India Dept. of Computer Technology Priyadarshani college of Engineering Hingana Nagpur. Nagpur India

ISBN: (纸本)9781479921041

Extracting data from deep web pages is a challenging problem due to the underlying intricate structures of such pages. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are web-page-programming-language-dependent. The contents on web pages are always displayed regularly for users to browse. There is different ways for deep web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep web pages. In this paper vision-based approach is web page programming-language-independent approach is proposed. This approach utilizes the visual features of the web pages to extract data from deep web pages including data record extraction and data item extraction. Again we also propose a new evaluation measure revision to capture human effort needed to produce exact extraction of data. Our implementation on large set of web databases describes the proposed vision-based approach is highly effective for data extraction from deep web pages.

关键词： web data mining web data extraction visual features for web pages

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：