The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled queryprocessing over very large data sets. However, the per-node performance of these systems is typicall...
详细信息
ISBN:
(纸本)9781728112466
The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled queryprocessing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
Many emerging Big Data programming environments, such as Spark and fink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning dat...
详细信息
ISBN:
(纸本)9781538672327
Many emerging Big Data programming environments, such as Spark and fink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning data analysis applications using the provided APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, current data analysis query languages, which are typically based on the relational model, cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model, and are checked for correctness at runtime, which results in a significantly longer program development time. To address these shortcomings, we introduce a new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer can find any possible join in a query, including joins hidden across deeply nested queries, thus unnesting any form of query nesting. Currently, DIQL can run on three Big Data platforms: Apache Spark, Apache fink, and Twitter's Cascading/Scalding.
In a distributed parallel query execution, complex queries are executed by splitting them into partially related simple subqueries executing each on a different node, communication between machines is often done by me...
详细信息
ISBN:
(纸本)9781728101323
In a distributed parallel query execution, complex queries are executed by splitting them into partially related simple subqueries executing each on a different node, communication between machines is often done by message exchange for shared-nothing architecture based grid. Integrity of messages can be lost by temporary or permanent interference along the communication network. Fault tolerance strategies are used to keep the system running in the presence of fault. This is traditionally done through query restart, replication or check pointing and other variations of these approaches to improve latency, restoration time and reduce cost of execution. These processes include: monitoring, detection and tolerance. Transient faults are caused by interference in the medium of exchange which may pass undetected but yielding an incorrect query result. Moreover, the traditional fault tolerance, there is a strong dependency between the nodes. In this research we proposed a model of a fault tolerance strategy that will allow self-detection and tolerate transient fault with less dependency between nodes. The model will be compared with the tradition strategies in terms of detection ability, inter node dependency, and cost of execution.
The goal of query optimization in query federation over linked data is to minimize the response time and the completion time. Communication time has the highest impact on them both. Static query optimization can end u...
详细信息
The goal of query optimization in query federation over linked data is to minimize the response time and the completion time. Communication time has the highest impact on them both. Static query optimization can end up with inefficient execution plans due to unpredictable data arrival rates and missing statistics. This study is an extension of adaptive join operator which always begins with symmetric hash join to minimize the response time, and can change the join method to bind join to minimize the completion time. The authors extend adaptive join operator with bind-bloom join to further reduce the communication time and, consequently, to minimize the completion time. They compare the new operator with symmetric hash join, bind join, bind-bloom join, and adaptive join operator with respect to the response time and the completion time. Performance evaluation shows that the extended operator provides optimal response time and further reduces the completion time. Moreover, it has the adaptation ability to different data arrival rates.
Process management is practical in the state-of-the-art of Internet of things research. However, this has become a bottleneck in recent years since an extreme amount of heterogeneous items have to be recorded and trac...
详细信息
Process management is practical in the state-of-the-art of Internet of things research. However, this has become a bottleneck in recent years since an extreme amount of heterogeneous items have to be recorded and traced with radio frequency identification (RFID) tags. In a typical application of process synthesis management, each item will be involved with multiple processes but when those processes are interconnected, an extremely complex network emerges and has to be managed. Existing works on processing management systems, however, are always case-based and only focus on specific application domains. Thus, the general applied processing management model is rather limited. In this paper, we summarize the characteristics of the RFID application domains, and by abstraction, we propose an innovation of the RFID processing model. In this model, each RFID data were deemed as an operation record and all their processing services are logically interconnected and organized as a procedure graph. In addition, we abstract and summarize the basic RFID data processes into a few types. The advantage of this design is that the basic RFID data processing logic can be preprogrammed and in a real domain, the system can be dynamically programmed by automatically constructing the procedure graph nodes with those basic processes and mapping the interconnection logic according to the topology of the graph. As the last part of this paper, we designed a prototype system for medical instruments infection control to demonstrate our approach.
Vehicular ad hoc networks (VANETs) have attracted a great interest in the last years due to their potential utility for drivers in applications that provide information about relevant events (accidents, emergency brak...
详细信息
ISBN:
(数字)9783319671628
ISBN:
(纸本)9783319671628;9783319671611
Vehicular ad hoc networks (VANETs) have attracted a great interest in the last years due to their potential utility for drivers in applications that provide information about relevant events (accidents, emergency brakings, etc.), traffic conditions or even available parking spaces. To accomplish this, the vehicles exchange data among them using wireless communications that can be obtained from different sources, such as sensors or alerts sent by other drivers. In this paper, we propose searching of parking spaces by using a mobile agent that jumps from one vehicle to another to reach the parking area and obtain the required data directly. We perform an experimental evaluation with promising results that show the feasibility of our proposal.
We address the problem of in-network processing of k-Maximizing Range Sum (k-MaxRS) queries in Wireless Sensor Networks (WSN). The traditional, Computational Geometry version of the MaxRS problem considers the setting...
详细信息
ISBN:
(纸本)9789897582110
We address the problem of in-network processing of k-Maximizing Range Sum (k-MaxRS) queries in Wireless Sensor Networks (WSN). The traditional, Computational Geometry version of the MaxRS problem considers the setting in which, given a set of (possibly weighted) 2D points, the goal is to determine the optimal location for a given (axes-parallel) rectangle R to be placed so that the sum of the weights (or, a simple count) of the input points in R's interior is maximized. In WSN, this corresponds to finding the location of region R such that the sum of the sensors' readings inside R is maximized. The k-MaxRS problem deals with maximizing the overall sum over k such rectangular regions. Since centralized processing -i.e., transmitting the raw readings and subsequently determining the k-MaxRS in a dedicated sink - incur communication overheads, we devised an efficient distributed algorithm for in-network computation of k-MaxRS. Our experimental observations show that the novel algorithm provides significant energy/communication savings when compared to the centralized approach.
In distributed databases, data is replicated and fragmented across multiple disparate sites spread across a computer network. Consequently, there can exist large numbers of possible query plans for a distributedquery...
详细信息
ISBN:
(数字)9789811033223
ISBN:
(纸本)9789811033223;9789811033216
In distributed databases, data is replicated and fragmented across multiple disparate sites spread across a computer network. Consequently, there can exist large numbers of possible query plans for a distributedquery. This number increases with increase in the number of sites containing the replicated data. For large numbers of sites, computing an efficient queryprocessing plan becomes a computationally expensive task. This necessitates the devising of a distributed query processing strategy capable of generating good quality query plans, from amongst all possible query plans, which minimize the total cost of processing a distributedquery. This distributedquery plan generation (DQPG) problem, being a combinatorial optimization problem, has been addressed in this paper using the modified cuckoo search algorithm. Accordingly, a modified CSA (mCSA) based DQPG algorithm (DQPGmCSA), which aims to generate good quality Top-K query plans for a given distributedquery, has been proposed herein. Experimental based comparison of DQPGmCSA with the existing GA based DQPG algorithm (DQPGGA) shows that the former is able to generate comparatively better quality Top-K query plans, which, in turn, would result in a reduction in the query response time and thereby enabling efficient decision making.
The benefit of performing Big data computations over individual's microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart...
详细信息
ISBN:
(纸本)9789897582554
The benefit of performing Big data computations over individual's microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart disclosure initiatives around the world. However, these computations often expose microdata to privacy leakages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised by statistical institutes. This paper proposes a novel approach to push personalized privacy guarantees in the processing of database queries so that individuals can disclose different amounts of information (i.e. data at different levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralized computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along the query execution process. A performance analysis conducted on a real platform shows the effectiveness of the approach.
Embedded electronic devices are now to be found everywhere. In general, they can be used to collect different sorts of data (e.g. on temperature, humidity, illumination and locations). In some specific domains, such a...
详细信息
Embedded electronic devices are now to be found everywhere. In general, they can be used to collect different sorts of data (e.g. on temperature, humidity, illumination and locations). In some specific domains, such as industrial automation, embedded devices are used for process control. The devices may have a programme that can respond immediately to environmental changes perceived through sensors. In the control of large sites, where there are many devices, higher level decisions are made or processed in dedicated computers far away from the sources (devices) where the initial data are collected. This article shows how it is possible to manage portions of distributed knowledge, hosted in embedded devices, making it possible for each embedded device to hold and manage its piece of knowledge. In addition, presented approach allows keeping locus of control at the embedded device level, where the embedded device can make decisions knowing the status of the rest of the world, device contributions and their effects in the overall distributed system knowledge base.
暂无评论