This paper presents a novel approach to web pagesegmentation, which is one of substantial preprocessing steps when mining data from Web documents. Most of the current segmentation methods are based on algorithms that...
详细信息
This paper presents a novel approach to web pagesegmentation, which is one of substantial preprocessing steps when mining data from Web documents. Most of the current segmentation methods are based on algorithms that work on a tree representation of web pages (DOM tree or a hierarchical rendering model) and produce another tree structure as an output. In contrast, our method uses a rendering engine to get an image of the web page, takes the smallest rendered elements of that image, performs clustering using a custom algorithm and produces a flat set of segments of a given granularity. For the clustering metrics, we use purely visual properties only: the distance of elements and their visual similarity. We experimentally evaluate the properties of our algorithm by processing 2400 web pages. On this set of web pages, we prove that our algorithm is almost 90% faster than the reference algorithm. We also show that our algorithm accuracy is between 47% and 133% of the reference algorithm accuracy with indirect correlation of our algorithm's accuracy to the depth of inspected page structure. In our experiments, we also demonstrate the advantages of producing a flat segmentation structure instead of an hierarchy. (C) 2017 Elsevier Ltd. All rights reserved.
A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from...
详细信息
ISBN:
(纸本)9788132204909
A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques such as DOM tree, our approach is independent to the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure.
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the...
详细信息
ISBN:
(纸本)9781479962723
The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
Segmentace WWW stránek, neboli dělení stránky na různé sémantické bloky, je jedna z disciplín techniky extrakce informací. Diplomová práce se zabývá metodou...
详细信息
Segmentace WWW stránek, neboli dělení stránky na různé sémantické bloky, je jedna z disciplín techniky extrakce informací. Diplomová práce se zabývá metodou vision-based page segmentation - VIPS, která spočívá v dělení stránky na základě vizuálních vlastností prvků stránky. Metoda je uvedena v kontextu dalších význačných segmentačních postupů. V práci jsou popsány a na příkladech ukázány nejdležitější kroky, ze kterých se tato metodika skládá. Pro metodu VIPS je nezbytná spolupráce s vykreslovacím jádrem WWW stránek, z důvodu získání DOM stromu stránky. V práci jsou představeny a popsány čtyři nejvýznačnější enginy pro programovací jazyk Java. Výstupem této práce je implementace algoritmu VIPS právě v jazyce Java s využitím jádra CSSBox. Dále je představena původní implementace algoritmu z laboratoří firmy Microsoft. Popsány jsou jednotlivé etapy vývoje knihovny realizující metodu VIPS a vlastního přístupu k jejímu řešení. Výsledek práce je v závěru demonstrován při segmentaci několika internetových stránek.
暂无评论