The incorporation of suitable external data from the World Wide web offers an effective solution for enriching the data in the data warehouse (DW). However, the main challenge is the quality-aware selection of web dat...
详细信息
The incorporation of suitable external data from the World Wide web offers an effective solution for enriching the data in the data warehouse (DW). However, the main challenge is the quality-aware selection of web data sources to maintain the quality of the DW. In the previous works, the quality evaluation of websources is through expert evaluation only, which makes it a very lengthy process. Also, since the quality model consists of mixed quality factors from diverse domains of web, DW and underlying business, finding an expert possessing an expertise of all these domains is a huge bottleneck in the evaluation process. In order to overcome these existing issues, this study proposes a novel multi-level approach web source evaluation with multi-criteria decision-making and web quality testing tools (WSEMQT) and underlying quality model web quality model for evaluating websources for the DW. The authors introduce automated web source quality evaluation in the first level of web source based evaluation and multiple dimensions of quality evaluation at the second level of expert-based evaluation. At both the levels, multi-criteria decision-making methods are applied to the evaluation scores obtained to ascertain the ranked list of websources. The authors present a real-world academic webdata case study which shows that the proposed approach can be executed successfully for real-world problems.
webdata are heterogeneous and unstructured, which defines challenges for data crawling, integration and preprocessing. Different studies are "data-oriented" (i.e. based on the available data) but their resu...
详细信息
ISBN:
(纸本)9781450358675
webdata are heterogeneous and unstructured, which defines challenges for data crawling, integration and preprocessing. Different studies are "data-oriented" (i.e. based on the available data) but their results are restricted to their specific data. In contrast, there are various problems prior to identifying what data is needed to solve them, and often multiple datasources are needed. In this context, crawling, integrating and preprocessing data appropriately enables to create datasets for solving such problems. Therefore, this short course addresses these three activities by discussing challenges and practical solutions.
In recent years, World Wide web has emerged as the most promising external data source for organizations' data Warehouses for valuable insights required in comprehensive decision making to gain a competitive edge....
详细信息
In recent years, World Wide web has emerged as the most promising external data source for organizations' data Warehouses for valuable insights required in comprehensive decision making to gain a competitive edge. However, when the data Warehouse uses external datasources from the web without quality evaluation, it can adversely impact its quality. Quality models have been proposed in the research literature to evaluate and select web data sources for their integration in a data Warehouse. However, these models are only conceptually proposed and not empirically validated. Therefore, in this paper, the authors present the empirical validation conducted on a set of 57 subjects to thoroughly validate the set of 22 quality factors and the initial structure of the multi-level, multi-dimensional webQMDW quality model. The validated and restructured webQMDW model thus obtained can significantly enhance the decision- making in the DW by selecting high-quality web data sources.
Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous datasources with advanced...
详细信息
Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous datasources with advanced data accessing, analyzing, and visualization tools. Building a digital library for scientific data requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the web as well as data generated by software. We present an approach to wrapping web data sources, databases, flat riles, or data generated by tools through a database view mechanism. Generally, a wrapper has two tasks: it first sends a query to the source to retrieve data and, second builds the expected output with respect to the virtual structure. Our wrappers are composed of a retrieval component based on an intermediate object view mechanism called search views mapping the source capabilities to attributes, and an extensible Markup Language (XML) engine, respectively, to perform these two tasks. The originality of the approach consists of: 1) a generic view mechanism to access seamlessly datasources with limited capabilities and 2) the ability to wrap datasources as well as the useful specific tools they may provide. Our approach has been developed and demonstrated as part of the multidatabase system supporting queries via uniform object protocol model (OPM) interfaces.
In the web environment, rich, diverse sources of heterogeneous and distributed data are ubiquitous. In fact, even the information characterizing a single entity - like, for example, the information related to a web se...
详细信息
ISBN:
(纸本)9783642013461
In the web environment, rich, diverse sources of heterogeneous and distributed data are ubiquitous. In fact, even the information characterizing a single entity - like, for example, the information related to a web service - is normally scattered over various datasources using various languages such as XML, RDF, and OWL. Hence, there is a strong need for web applications to handle queries over heterogeneous, autonomous, and distributed datasources. However, existing techniques do not provide sufficient support for this task. In this paper we present DeXIN, an extensible framework for providing integrated access over heterogeneous, autonomous, and distributed web data sources, which can be utilized for data integration in modern web applications and Service Oriented Architecture. DeXIN extends the XQuery language by supporting SPARQL queries inside XQuery, thus facilitating the query of data modeled in XML, RDF, and OWL. DeXIN facilitates data integration in a distributed web and Service Oriented environment by avoiding the transfer of large amounts of data to a central server for centralized data integration and exonerates the transformation of huge amount of data into a common format for integrated access.
暂无评论