检索结果-内蒙古大学图书馆

Box clustering segmentation: A new method for vision-based web page preprocessing

INFORMATION PROCESSING & MANAGEMENT 2017年第3期53卷 735-750页

作者： Zeleny, Jan Burget, Radek Zendulka, Jaroslav Brno Univ Technol Fac Informat Technol Ctr Excellence IT4 lnnovat Bozetechova 2 Brno 61266 Czech Republic

This paper presents a novel approach to web page segmentation, which is one of substantial preprocessing steps when mining data from Web documents. Most of the current segmentation methods are based on algorithms that work on a tree representation of web pages (DOM tree or a hierarchical rendering model) and produce another tree structure as an output. In contrast, our method uses a rendering engine to get an image of the web page, takes the smallest rendered elements of that image, performs clustering using a custom algorithm and produces a flat set of segments of a given granularity. For the clustering metrics, we use purely visual properties only: the distance of elements and their visual similarity. We experimentally evaluate the properties of our algorithm by processing 2400 web pages. On this set of web pages, we prove that our algorithm is almost 90% faster than the reference algorithm. We also show that our algorithm accuracy is between 47% and 133% of the reference algorithm accuracy with indirect correlation of our algorithm's accuracy to the depth of inspected page structure. In our experiments, we also demonstrate the advantages of producing a flat segmentation structure instead of an hierarchy. (C) 2017 Elsevier Ltd. All rights reserved.

关键词： Clustering segmentation vision-based page segmentation VIPS

来源：评论

学校读者我要写书评

暂无评论

Schema Inference and Data Extraction from Templatized Web pages

Schema Inference and Data Extraction from Templatized Web Pa...

引用

International Conference on Pervasive Computing (ICPC)

作者： Krishna, Shinde Santaji Dattatraya, Joshi Shashank Shri Jagdish Prasad Jhabarmal Tibrewala Univ Dept Comp Engn Jhunjhunu Rajasthan India Bharati Vidyapeeth Deemed Univ Coll Engn Dept Comp Engn Pune Maharashtra India

ISBN: (纸本)9781479962723

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.

关键词： Data Extraction Multiple Tree Merging Schema vision-based page segmentation Web page

来源：评论

学校读者我要写书评

暂无评论

A Visual based page segmentation for Deep Web Data Extraction

A Visual Based Page Segmentation for Deep Web Data Extractio...

引用

International Conference on Soft Computing for Problem Solving (SocProS 2011)

作者： Palekar, Vikas R. MEIT Prof Ram Meghe Inst Technol & Res Badnera Maharashtra India

ISBN: (纸本)9788132204909

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques such as DOM tree, our approach is independent to the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure.

关键词： page segmentation vision-based page segmentation Web Mining

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：