检索结果-内蒙古大学图书馆

Unlocking Social Media and User Generated Content as a data Source for Knowledge Management

INTERNATIONAL JOURNAL OF KNOWLEDGE MANAGEMENT 2020年第1期16卷 101-122页

作者： Meneghello, James Thompson, Nik Lee, Kevin Wong, Kok Wai Abu-Salih, Bilal Optika Solut Perth Australia Curtin Univ Perth WA Australia Deakin Univ Software Engn & Internet Things IoT Sch Informat Technol Geelong Vic Australia Murdoch Univ Sch Engn & Informat Technol Murdoch WA Australia Univ Jordan Amman Jordan

The pervasiveness of social media and user-generated content has triggered an exponential increase in global data. However, due to collection and extraction challenges, data in embedded comments, reviews and testimonials are largely inaccessible to a knowledge management system. This article describes a KM framework for the end-to-end knowledge management and value extraction from such content. This framework embodies solutions to unlock the potential of UGC as a rich, real-time data source. Three contributions are described in this article. First, a method for automatically navigating webpages to expose UGC for collection is presented. This is evaluated using browser emulation integrated with automated collection. Second, a method for collecting data without any a priori knowledge of the sites is introduced. Finally, a new testbed is developed to reflect the current state of internet sites and shared publicly to encourage future research. The discussion benchmarks the new algorithm alongside existing techniques, providing evidence of the increased amount of UGC data extracted.

关键词： Content Discovery data Acquisition data Manipulation Knowledge Management Social Mining User-Generated Content web data extraction

来源：评论

学校读者我要写书评

暂无评论

Smart algorithmic based web crawling and scraping with template autoupdate capabilities

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2021年第22期33卷 e6042-e6042页

作者： Khan, Fazal Qudus Tsaramirsis, Georgios Ullah, Naimat Nazmudeen, Mohamed Jan, Sadeeq Ahmad, Awais King Abdulaziz Univ Fac Comp & IT Dept IT Jeddah Saudi Arabia Univ Buner Khyber Pakhtunkhwa Pakistan Univ Technol Brunei Bandar Seri Begawan Brunei Univ Engn & Technol Peshawar Pakistan Air Univ Dept Comp Sci Islamabad 44000 Pakistan

web scraping is the process of extracting data from web pages and it is an essential part for the generation of datasets. Currently the field is dominated by capable commercial applications, however, there is always a need for web crawling and web scraping applications for custom projects. Developing fit for purpose tools for retrieving and structuring data from web services, cloud systems, and big data is a challenging task. Based on empirical studies, some of the challenges include structural issues, formatting/ presentation, availability, denial of service, size, and information fetching problems with browsers. Additionally, the data become inaccessible after the structure/template of the website changes for example, after the website update. Thus the dataset cannot be updated in the future without manually modifying the parameters of the web Scraper. In this paper we propose an algorithm capable of autocorrecting the template (web scraping parameters) used for locating the target data and dealing with some common empirical problems. This is very useful in case there is a need for updating the dataset later, as usually, websites tend to change their pages. Moreover, we introduce an implementation of the algorithm via a tool developed for extracting data from the unity asset store. The tool can capture and store data in XML format. The tool extracted a total of 46 785 (40 611 3D and 6174 2D) items, with 35 successful first retries, 11 second retries and 5 fails.

关键词： unity asset store dataset web crawling web data extraction web harvesting web scraping

来源：评论

学校读者我要写书评

暂无评论

Development of Browser Extension for HTML web Page Content extraction 2

Development of Browser Extension for HTML Web Page Content E...

引用

2nd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)

作者： Karabulut, Murat Mayda, Islam Istanbul Esenyurt Univ Bilgisayar Muhendisligi Bolumu Istanbul Turkey Yildiz Tekn Univ Bilgisayar Muhendisligi Bolumu Istanbul Turkey

ISBN: (数字)9781728193526

ISBN: (纸本)9781728193526

As the amount of content on the websites increases, automatic content extraction from web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing web structure. In this study, a browser extension was developed to automatically download text content on web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular web sites in Turkey and has been shown to work successfully.

关键词： web content extraction web data extraction web scraping

来源：评论

学校读者我要写书评

暂无评论

Articulating the Construction of a web Scraper for Massive data extraction 2

Articulating the Construction of a Web Scraper for Massive D...

引用

2nd IEEE International Conference on Electrical, Computer and Communication Technologies (IEEE ICECCT)

作者： Upadhyay, Shreya Pant, Vishal Bhasin, Shivansh Pattanshetti, Mahantesh K. Graph Era Deemed Univ Dept Comp Sci & Engn Dehra Dun India Graph Era Hill Univ Dept Comp Sci & Engn Dehra Dun India

ISBN: (纸本)9781509032396

Massive volumes of data are generated by various users, entities, applications and disseminated online. This copious volume of big data is distributed across millions of websites and is available for various applications. Search engines do provide a simple mechanism to access this data. Accessing this data using search engines requires a user to spend time and resources to manually click and download. Clearly, such a manual approach is not scalable for a vast majority of real life applications at the enterprise and organization level. There exist a number of automated approaches to data extraction from the web. Most of these approaches are ad-hoc and domain specific. Therefore, the need for a robust, automated, easy to use framework for extracting content from the web with a minimal human effort across domains appears enticing. The architecture proposed by the authors for a web scraper addresses this gap to harvest data from the web. The proposed web scraping framework offers an easy and feasible approach for parsing and extracting data on a large scale from multiple websites with minimal human intervention. This paper provides an insight into issues relevant to constructing a web scraper and concludes by describing the implementation of a web scraper for harvesting learning objects for an eLearning application.

关键词： web scraping web data extraction web information extraction knowledge-based systems web data analysis

来源：评论

学校读者我要写书评

暂无评论

web data extraction for Developing a Mashup

Web Data Extraction for Developing a Mashup

引用

International MultiConference of Engineers and Computer Scientists (IMECS 2012)

作者： Chaudhari, Poonam. A. Paikrao, Rahul. L. Gokhale Educ Soc COE Nasik India Amrutvahini COE Sangamner India

ISBN: (纸本)9789881925114

web is a huge reservoir of information. data available is extremely diversified and abundant. Various types of data can be easily extracted from the Internet, although not all of the data is relevant to the users. Most web pages are in unstructured HTML format, making web data extraction process very time consuming and costly. There is a necessity to convert unstructured HTML format into a new structured format such as XML or XIITML. We propose an approach for implementing web data extraction and developing a Mashup from HTML web pages. It also helps to collaborate and integrate various stages of building a Mashup, i.e., data Retrieval, data Source Modeling, data Cleaning/Filtering, data Integration and data Visualization. The data modeling stage renders Document Object Model (DOM) tree with the help of HTML Parser. Some algorithms and rules are used so that it can specifically analyze the HTML tags and extract the data into a new format. The core algorithm can extracts web data tables using recursive technique while rendering the DOM tree model automatically. Furthermore, our application enables the user to perform his task without the need to write a script or program or even without any knowledge of computer programming. The Mashup created will help in the decision making process, which is the prima facie requirement for success in corporate world.

关键词： web data extraction Making Mashup Mashup Stages HTML XML DOM tree

来源：评论

学校读者我要写书评

暂无评论

VB-PTC: Visual Block Multi-Record Text extraction Based on Sensor Network Page Type Conversion

引用

IEEE ACCESS 2020年 8卷 167900-167913页

作者： Gong, Jibing Zhang, Hekai Du, Weixia Li, Huanhuan Wen, Hongnian Yanshan Univ Sch Informat Sci & Engn Qinhuangdao 066004 Hebei Peoples R China Yanshan Univ Key Lab Comp Virtual Technol & Syst Integrat Hebe Qinhuangdao 066004 Hebei Peoples R China Yanshan Univ Key Lab Software Engn Hebei Prov Qinhuangdao 066004 Hebei Peoples R China Shijiazhuang Inst Railway Technol Sch Informat Sci & Engn Shijiazhaung 050041 Peoples R China

Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.

关键词： web pages Feature extraction Visualization data mining Noise reduction Navigation data collection Dom trees HTML documents noise elimination web data extraction web mining

来源：评论

学校读者我要写书评

暂无评论

web Scraping: State-of-the-Art and Areas of Application

Web Scraping: State-of-the-Art and Areas of Application

引用

IEEE International Conference on Big data (Big data)

作者： Diouf, Rabiyatou Sarr, Edouard Ngor Sall, Ousmane Birregah, Babiga Bousso, Mamadou Mbaye, Seny Ndiaye Univ Thies Thies Senegal UCAO St Michel Dakar Senegal Univ Technol Troyes Troyes France

ISBN: (纸本)9781728108582

Main objective of web Scraping is to extract information from one or many websites and process it into simple structures such as spreadsheets, database or CSV file. However, in addition to be a very complicated task, web Scraping is resource and time consuming, mainly when it is carried out manually. Previous studies have developed several automated solutions. The purpose of this article is to revisit the different existing web Scraping approaches, categories, and tools, but also its areas of application.

关键词： web-Scraping data Collection web data extraction

来源：评论

学校读者我要写书评

暂无评论

RED: Redundancy-Driven data extraction from Result Pages 19

RED: Redundancy-Driven Data Extraction from Result Pages

引用

World Wide web Conference (WWW)

作者： Guo, Jinsong Crescenzi, Valter Furche, Tim Grasso, Giovanni Gottlob, George Univ Oxford Oxford England Univ Roma Tre Rome Italy Meltwater San Francisco CA USA Univ Calabria Calabria Italy TU Wien Vienna Austria

ISBN: (纸本)9781450366748

data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present RED, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. RED leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, RED can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.

关键词： web data extraction Automatic Wrapper Generation XPath

来源：评论

学校读者我要写书评

暂无评论

Performance Analysis for Mining Images of Deep web

引用

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 2020年第10期11卷 1-7页

作者： Sabri, Ily Amalina Ahmad Man, Mustafa Univ Malaysia Terengganu Fac Ocean Engn Technol & Informat Terengganu Malaysia

In this paper, advancing web scale knowledge extraction and alignment by integrating few sources has been considered by exploring different methods of aggregation and attention in order to focus on image information. An improved model, namely, Wrapper extraction of Image using DOM and JSON (WEIDJ) has been proposed to extract images and the related information in fastest way. Several models, such as Document Object Model (DOM), Wrapper using Hybrid DOM and JSON (WHDJ), WEIDJ and WEIDJ (no-rules) are been discussed. The experimental results on real world websites demonstrate that our models outperform others, such as Document Object Model (DOM), Wrapper using Hybrid DOM and JSON (WHDJ) in terms of mining in a higher volume of web data from a various types of image format and taking the consideration of web data extraction from deep web.

关键词： data extraction Document Object Model web data extraction Wrapper using Hybrid DOM and JSON Wrapper extraction of Image using DOM and JSON

来源：评论

学校读者我要写书评

暂无评论

web data extraction research based on wrapper and XPath technology

Web data extraction research based on wrapper and XPath tech...

引用

International Conference on Advanced Materials and Information Technology Processing (AMITP 2011)

作者： Liu, Hong Ma, YinXiao Zhejiang Gongshang Univ Coll Comp & Informat Engn Hangzhou Zhejiang Peoples R China

ISBN: (纸本)9783037851579

For satisfy people's various need, some websites consist of pages that are dynamically generated using a common template populated with data from www, such as product description pages on e-commerce sites. In this paper, it merges wrapper technology with XPath to form a dependable, robust process for web data extraction. Through validating such a method in some experiments;we get results that it has high efficiency in extracting list page.

关键词： web data extraction wrapper XPath XML thread

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：