检索结果-内蒙古大学图书馆

BMC BIOINFORMATICS 2020年第1期21卷 296-296页

作者： Washburn, Alex J. Wheeler, Ward C. Amer Museum Nat Hist Div Invertebrate Zool 200 Cent Pk West New York NY 10024 USA

Background: Given a binary tree T of n leaves, each leaf labeled by a string of length at most k, and a binary string alignment function circle times, an implied alignment can be generated to describe the alignment of a dynamic homology for T. This is done by first decorating each node of T with an alignment context using circle times, in a post-order traversal, then, during a subsequent pre-order traversal, inferring on which edges insertion and deletion events occurred using those internal node decorations. Results: Previous descriptions of the implied alignment algorithm suggest a technique of "back-propagation" with time complexity O (k(2) * n(2)). Here we describe an implied alignment algorithm with complexity O (k * n(2)). For well-behaved data, such as molecular sequences, the runtime approaches the best-case complexity of Omega(k * n). Conclusions: The reduction in the time complexity of the algorithm dramatically improves both its utility in generating multiple sequence alignments and its heuristic utility.

关键词： Dynamic homology Implied alignment multiple string alignment Phylogenetics Sequence alignment Tree alignment

来源：评论

学校读者我要写书评

暂无评论

A novel alignment algorithm for effective web data extraction from singleton-item pages

引用

APPLIED INTELLIGENCE 2018年第11期48卷 4355-4370页

作者： Yuliana, Oviliani Yenty Chang, Chia-Hui Natl Cent Univ CSIE Taoyuan 32001 Taiwan

Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805-816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).

关键词： Web data extraction Template pages Singleton pages Full-schema Divide-conquer alignment multiple string alignment

来源：评论

学校读者我要写书评

暂无评论

Olera: Semisupervised web-data extraction with visual support

引用

IEEE INTELLIGENT SYSTEMS 2004年第6期19卷 56-64页

作者： Chang, CH Kuo, SC Natl Cent Univ Dept Comp Sci & Informat Engn Chungli Taiwan

OLEPA is a semisupervised information-extraction system that produces extraction rules from semistructured Web documents without requiring detailed annotation of the training documents. It performs well for program-ge... 详细信息

关键词： Semistructured Data Web Data Extraction multiple string alignment Rule Generalization

来源：评论

学校读者我要写书评

暂无评论

Automatic information extraction from semi-structured Web pages by pattern discovery

引用

DECISION SUPPORT SYSTEMS 2003年第1期35卷 129-147页

作者： Chang, CH Hsu, CN Lui, SC Natl Cent Univ Dept Comp Sci & Informat Engn Chungli 320 Tauyuan Taiwan Acad Sinica Inst Informat Sci Taipei 115 Taiwan ChungHwa Telecommun Labs Yangmei 326 Tauyuan Taiwan

The World Wide Web is now undeniably the richest and most dense source of information;yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources. (C) 2002 Elsevier Science B.V. All rights reserved.

关键词： information extraction semi-structured data wrapper generation PAT trees multiple string alignment

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：