检索结果-内蒙古大学图书馆

Robust web data extraction Based on Weighted Path-layer Similarity

JOURNAL OF COMPUTER INFORMATION SYSTEMS 2022年第3期62卷 536-546页

作者： Gao, Peng Han, Hao NARI Grp Corp State Grid Elect Power Res Inst Lab Informat Secur & Software Engn Nanjing Peoples R China Konica Minolta Tokyo Japan

web data extraction techniques often focus on accurate and efficient information acquisition from webpages. However, webpage variants cause frequent extraction to fail and result in high maintenance costs. Significant effort is attracted to robust extraction, but most either require complex pre-processing or supplementary files. In this paper, a novel method is proposed to enhance extraction robustness by using datatype and weight information of path-layers. The similarities between paths of the target node in the original webpage and candidate nodes in page variants are calculated to determine the node with the highest possibility. Experiments on a large set of real data show that this method yields better robustness than the existing approaches.

关键词： web data extraction xpath datatype path-layer similarity semi-structured data

来源：评论

学校读者我要写书评

暂无评论

Intelligent and Adaptive web data extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks

引用

Big data Mining and Analytics 2021年第4期4卷 279-297页

作者： Sudhir Kumar Patnaik C.Narendra Babu Mukul Bhave Department of Computer Science and Engineering M.S.Ramaiah University of Applied SciencesBangalore 560054India Gibraltar India Solutions LLP Bangalore 560103India Department of Computer Science and Engineering M.S.Ramaiah University of Applied SciencesBangalore 560054India.

data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences,which are collected using advanced web scraping ***,core data extraction engines fail because they cannot adapt to the dynamic changes in website *** study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory(LSTM)networks to enable automated web page detection using the You only look once(Yolo)algorithm and Tesseract LSTM to extract product details,which are detected as images from web *** state-of-the-art system does not need a core data extraction engine,and thus can adapt to dynamic changes in website *** conducted on real-world retail cases demonstrate an image detection(precision)and character extraction accuracy(precision)of 97%and 99%,*** addition,a mean average precision of 74%,with an input dataset of 45 objects or images,is obtained.

关键词： adaptive web scraping deep learning Long Short-Term Memory(LSTM) web data extraction You only look once(Yolo)

来源：评论

学校读者我要写书评

暂无评论

Trends in web data extraction using machine learning

引用

web INTELLIGENCE 2021年第3期19卷 169-190页

作者： Patnaik, Sudhir Kumar Babu, C. Narendra MS Ramaiah Univ Appl Sci Dept Comp Sci Bangalore Karnataka India

web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from web page and documents to automated extraction to an intelligent extraction using machine learning algorithms, tools and techniques. data extraction is one of the key components of end-toend life cycle in web data extraction process that includes navigation, extraction, data enrichment and visualization. This paper presents the journey of web data extraction over the last many years highlighting evolution of tools, techniques, frameworks and algorithms for building intelligent web data extraction systems. The paper also throws light into challenges, opportunities for future research and emerging trends over the years in web data extraction with specific focus on machine learning techniques. Both traditional and machine learning approaches to manual and automated web data extraction are experimented and results published with few use cases demonstrating the challenges in web data extraction in the event of changes in the website layout. This paper introduces novel ideas such as self-healing capability in web data extraction and proactive error detection in the event of changes in website layout as an area of future research. This unique perspective will help readers to get deeper insights in to the present and future of web data extraction.

关键词： Automated data quality machine learning navigation web data extraction

来源：评论

学校读者我要写书评

暂无评论

web data extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference 20

Web Data Extraction using Hybrid Program Synthesis: A Combin...

引用

ACM SIGMOD International Conference on Management of data (SIGMOD)

作者： Raza, Mohammad Gulwani, Sumit Microsoft Corp Redmond WA 98052 USA

ISBN: (纸本)9781450367356

Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various robustness and usability challenges. In this work we present a novel program synthesis approach which combines the benefits of deductive and enumerative synthesis strategies, yielding a semi-supervised technique with which concise programs expressible in standard languages can be synthesized from very few examples. We demonstrate improvement over existing techniques in terms of overall accuracy, number of examples required, and program complexity. Our method has been deployed as a web extraction feature in the mass market Microsoft Power BI product.

关键词： web data extraction program synthesis wrapper induction

来源：评论

学校读者我要写书评

暂无评论

web data extraction Approach for Deep web using WEIDJ

引用

Procedia Computer Science 2019年 163卷 417-426页

作者： Ily Amalina Ahmad Sabri Mustafa Man Wan Aezwani Wan Abu Bakar Ahmad Nazari Mohd Rose School of Informatics and Applied Mathematics Universiti Malaysia Terengganu Terengganu Malaysia Faculty of Informatics and Computing University Sultan Zainal Abidin Terengganu Malaysia

data extraction is one of the most prominent areas in data mining analysis that is been extensively studied especially in the field of data requirements and reservoir. The main aim of data extraction with regards to semi-structured data is to retrieve beneficial information from the World Wide web. The data from large web data also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. data mining applications and automatic data extraction are very cumbersome due to the diverse structure of web pages. Most of the previous data extraction techniques were dealing with various data types such as text, audio, video and etc. but research works that are focusing on image as data are still lacking. Document Object Model (DOM) is an example of the state of the art of data extraction technique that is related to research work in mining image data. DOM was the method used to solve semi-structured data extraction from web. However, as the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time and noisy information. In this research work, we propose an improved model namely Wrapper extraction of Image using DOM and JSON (WEIDJ) in response to the promising results of mining in a higher volume of web data from a various types of image format and taking the consideration of web data extraction from deep web. To observe the efficiency of the proposed model, we compare the performance of data extraction by different level of page extraction with existing methods such as VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547.

关键词： Document Object Model web data extraction Wrapper extraction of Image using DOM JSON (WEIDJ)

来源：评论

学校读者我要写书评

暂无评论

A novel alignment algorithm for effective web data extraction from singleton-item pages

引用

APPLIED INTELLIGENCE 2018年第11期48卷 4355-4370页

作者： Yuliana, Oviliani Yenty Chang, Chia-Hui Natl Cent Univ CSIE Taoyuan 32001 Taiwan

Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer Alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805-816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).

关键词： web data extraction Template pages Singleton pages Full-schema Divide-conquer alignment Multiple string alignment

来源：评论

学校读者我要写书评

暂无评论

Combining web data extraction and data Mining Techniques to Discover Knowledge

Combining Web Data Extraction and Data Mining Techniques to ...

引用

IEEE Middle East and North Africa Communications Conference (MENACOMM)

作者： Bouldoukian, Nathalie A. Holy Spirit Univ Kaslik USEK Dept Comp Sci Jounieh Lebanon

ISBN: (纸本)9781538612545

The design and implementation of a support system for Knowledge Discovery is the challenge of many researchers. As data Mining is the main key step in Knowledge Discovery process in databases (KDD), it is necessary to find a new methodology that combines web data extraction playing the role of data collection from the web and data mining techniques on the extracted categorical data in order to discover knowledge. The main contribution of this research is proposing a methodology to apply the clustering notion on categorical web data and to use the clustering results as part of the input for the classification conducted on another set of data. data mining and relative data processing are conducted by developing intelligent tools. The performance of the algorithms used in our methodology is demonstrated with the clustered job postings dataset and classified job searchers dataset by using the three measures accuracy, recall and precision for the clustering algorithm and the error of classification for the classification technique. The results show that our proposed approach of combination ends up with good results in Knowledge Discovery from the web.

关键词： web data extraction data Mining Clustering Classification data processing

来源：评论

学校读者我要写书评

暂无评论

Browserless web data extraction: Challenges and Opportunities 18

Browserless Web Data Extraction: Challenges and Opportunitie...

引用

27th World Wide web (WWW) Conference

作者： Fayzrakhmanov, Ruslan R. Sallinger, Emanuel Spencer, Ben Furche, Tim Gottlob, Georg Univ Oxford Dept Comp Sci Oxford England

ISBN: (纸本)9781450356398

Most modern web scrapers use an embedded browser to render web pages and to simulate user actions. Such scrapers (or wrappers) are therefore expensive to execute, in terms of time and network traffic. In contrast, it is magnitudes more resource-efficient to use a "browserless" wrapper which directly accesses a web server through HTTP requests, and takes the desired data directly from the raw replies. However, creating and maintaining browserless wrappers of high precision requires specialists, and is prohibitively labor-intensive at scale. In this paper, we demonstrate the principal feasibility of automatically translating browser-based wrappers into "browserless" wrappers. We present the first algorithm and system performing such an automated translation on suitably restricted types of web sites. This system works in the vast majority of test cases and produces very fast and extremely resource-efficient wrappers. We discuss research challenges for extending our approach to a general method applicable to a yet larger number of cases.

关键词： web data extraction scraping deep web HTTP AJAX

来源：评论

学校读者我要写书评

暂无评论

STEM: a suffix tree-based method for web data records extraction

引用

KNOWLEDGE AND INFORMATION SYSTEMS 2018年第2期55卷 305-331页

作者： Fang, Yixiang Xie, Xiaoqin Zhang, Xiaofeng Cheng, Reynold Zhang, Zhiqiang Univ Hong Kong Dept Comp Sci Pokfulam Rd Hong Kong Hong Kong Peoples R China Harbin Engn Univ Coll Comp Sci & Technol Harbin Heilongjiang Peoples R China Harbin Inst Technol Shenzhen Grad Sch Sch Comp Sci G709HIT Campus Shenzhen Peoples R China

To automatically extract data records from web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in web pages, which indicates the potential applicability of STEM in a wide range of web-scale data record extraction applications.

关键词： web data extraction Suffix tree HTML tag path data Record pattern

来源：评论

学校读者我要写书评

暂无评论

web data extraction Techniques: A Review

Web Data Extraction Techniques: A Review

引用

World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave)

作者： Kamanwar, N. V. Kale, S. G. YCCE Dept Informat Technol Hingna Rd Nagpur 441110 Maharashtra India

ISBN: (纸本)9781467392143

web data extraction is the process of extracting user required information from websites. The web document contains data which is not in structured format. From the word web data extraction, we mean the extraction of data that is present in the web documents in HTML format. Then removing the unwanted stuff such as tags, advertisements, videos and so on. Then learning the information or patterns or features present in that data. Today, most researchers uses web data extractors because the internet contains huge data which makes the process of manual information extraction from the web documents complicated. In this paper, we have studied about different techniques for data extraction used by different authors that takes the user required data from a set of web pages. A comparative analysis of web data extraction techniques is given.

关键词： web data extraction Unsupervised learning data Path Hadoop MapReduce subject detection alignment visual features wrapper Trinity

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：