Background: Given a binary tree T of n leaves, each leaf labeled by a string of length at most k, and a binary stringalignment function circle times, an implied alignment can be generated to describe the alignment of...
详细信息
Background: Given a binary tree T of n leaves, each leaf labeled by a string of length at most k, and a binary stringalignment function circle times, an implied alignment can be generated to describe the alignment of a dynamic homology for T. This is done by first decorating each node of T with an alignment context using circle times, in a post-order traversal, then, during a subsequent pre-order traversal, inferring on which edges insertion and deletion events occurred using those internal node decorations. Results: Previous descriptions of the implied alignment algorithm suggest a technique of "back-propagation" with time complexity O (k(2) * n(2)). Here we describe an implied alignment algorithm with complexity O (k * n(2)). For well-behaved data, such as molecular sequences, the runtime approaches the best-case complexity of Omega(k * n). Conclusions: The reduction in the time complexity of the algorithm dramatically improves both its utility in generating multiple sequence alignments and its heuristic utility.
Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pag...
详细信息
Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805-816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).
OLEPA is a semisupervised information-extraction system that produces extraction rules from semistructured Web documents without requiring detailed annotation of the training documents. It performs well for program-ge...
详细信息
OLEPA is a semisupervised information-extraction system that produces extraction rules from semistructured Web documents without requiring detailed annotation of the training documents. It performs well for program-generated Web pages with few training pages and limited user intervention.
The World Wide Web is now undeniably the richest and most dense source of information;yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery ...
详细信息
The World Wide Web is now undeniably the richest and most dense source of information;yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources. (C) 2002 Elsevier Science B.V. All rights reserved.
暂无评论