咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Web data extraction based on s... 收藏

Web data extraction based on structural similarity

网数据抽取基于结构的类似

作     者:Li, Z Ng, WK Sun, AX 

作者机构:Nanyang Technol Univ Ctr Adv Informat Syst Sch Comp Engn Singapore Singapore 

出 版 物:《KNOWLEDGE AND INFORMATION SYSTEMS》 (知识和信息系统季刊)

年 卷 期:2005年第8卷第4期

页      面:438-461页

核心收录:

学科分类:0711[理学-系统科学] 07[理学] 08[工学] 070105[理学-运筹学与控制论] 081101[工学-控制理论与控制工程] 0701[理学-数学] 071101[理学-系统理论] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:classification clustering framework web data extraction 

摘      要:Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document an be represented as a vector of schema, it can be easily incorporated into existing systems as the fabric for integration.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分