Accurate peak power evaluation of query processing is fundamental to a power-aware DBMS running in large data centers. To estimate the peak power of the core operator join in query processing, the concept of CPU-bound...
详细信息
Accurate peak power evaluation of query processing is fundamental to a power-aware DBMS running in large data centers. To estimate the peak power of the core operator join in query processing, the concept of CPU-boundedness was introduced, i.e., the ratio of CPU-intensive operations in unit time. The power prediction models were constructed with the piecewise and continuous fitting methods;and the multivariate model was further developed by incorporating both CPU-boundedness and CPU-frequency into the model via surface fitting. A non-runtime peak power estimation method is proposed for four most commonly used join algorithms in DBMS. To the best of our knowledge, our work is the first attempt towards modeling and estimating the peak power of query processing. Extensive experiments have demonstrated the effectiveness of our proposed methods with acceptable mean relative errors. (C) 2014 Elsevier Inc. All rights reserved.
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has so...
详细信息
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively;and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.
The performance of query execution in distributed data- base systems depends mainly on the efficiency of the distributed algorithms used for executing query operations and the efficiency of the query processing strate...
详细信息
The performance of query execution in distributed data- base systems depends mainly on the efficiency of the distributed algorithms used for executing query operations and the efficiency of the query processing strategies. In this paper, we present and evaluate a query processing strategy which is based on pipelining and data-flow techniques, We develop timing equations for calculating the performance of four join algorithms: 1) rusted block, 2) hash, 3) sort-merge, and 4) pipelined sort-merge. They are used to execute the join operation in a query in distributed fashion and in pipelined fashion. Based on these equations and similar sets of equations developed for other relational algebraic operations, we evaluated the performance of query execution using the different join algorithms. The effects of varying the values of 1) processing time, 2) I/O time, 3) communication time, 4) buffer size, and 5) join selectivity on the performance of the pipelined join algorithms are investigated. The results are compared to the results obtained by employing the same algorithms for executing queries using the distributed processing approach which does not exploit the vertical concurrency of the pipelining approach. These results clearly establish the benefits of pipelining. [ABSTRACT FROM AUTHOR]
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for large-scale data processing. However, it has so...
详细信息
Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for large-scale data processing. However, it has some limitations in processing heterogeneous datasets. In this study, we review the state-of-the-art strategies for joining two datasets based on an equi-join condition and provide a detail implementation for each strategy. We also present an in-depth analysis of the join strategies and discuss their feasibilities and limitations to assist the reader in selecting the most efficient strategy. Concluding, we outline interesting directions for future join processing systems. (c) 2020 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://***/licenses/by-nc-nd/4.0/).
We present a dynamic index structure for join sampling. Built for an (equi-) join Q - let IN be the total number of tuples in the input relations of Q - the structure uses (O) over tilde( IN) space, supports a tuple u...
详细信息
ISBN:
(纸本)9798400701276
We present a dynamic index structure for join sampling. Built for an (equi-) join Q - let IN be the total number of tuples in the input relations of Q - the structure uses (O) over tilde( IN) space, supports a tuple update of any relation in (O) over tilde (1) time, and returns a uniform sample from the join result in (O) over tilde (IN rho* /max{1, OUT}) time with high probability (w.h.p.), where OUT and rho* are the join's output size and fractional edge covering number, respectively;notation (O) over tilde(.) hides a factor polylogarithmic to IN. We further show how our result justifies the (O) over tilde (IN rho*) running time of existing worst-case optimal join algorithms (for full result reporting) even when OUT << IN rho*. Specifically, unless the combinatorial K-clique hypothesis is false, no combinatorial algorithms (i.e., algorithms not relying on fast matrix multiplication) can compute the join result in O( IN rho*-epsilon) time w.h.p. even if OUT = IN epsilon, regardless of how small the constant epsilon > 0 is.
Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a new algorithm to pr...
详细信息
Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a new algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a natural join query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer's entropy inequality, which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose runtimes achieve these optimal bounds. An answer to this question may be interesting to database practice, as we show in this article that any project-join style plans, such as ones typically employed in a relational database management system, are asymptotically slower than the optimal for some queries. We present an algorithm whose runtime is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer's inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and its generalization, the Bollobas-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to evaluate full conjunctive queries optimally, to compute a relaxed notion of joins and to optimally (in the worst-case) enumerate all induced copies of a fixed subgraph inside of a given large graph.
Distributed query processing algorithms usually perform data reduction by using a semijoin program, but the problem with these approaches is that they still require an explicit join of the reduced relations in the fin...
详细信息
Distributed query processing algorithms usually perform data reduction by using a semijoin program, but the problem with these approaches is that they still require an explicit join of the reduced relations in the final phase. We introduce an efficient algorithm for join processing in distributed database systems that makes use of bipartite graphs in order to reduce data communication costs and local processing costs. The bipartite graphs represent the tuples that can be joined in two relations taking also into account the reduction state of the relations. This algorithm fully reduces the relations at each site. We then present an adaptive algorithm for response time optimization that takes into account the system configuration, i.e., the additional resources available and the data characteristics, in order to select the best strategy for response time minimization. We also report on the results of a set of experiments which show that our algorithms outperform a number of the recently proposed methods for total processing time and response time minimization.
In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but al...
详细信息
In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling, and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration. tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an-order of magnitude on modern RISC hardware when both memory and CPU resources are optimized.
Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to...
详细信息
Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to fill this gap: We contribute a suite of solutions and study their performance comprehensively. Baseline solutions are offered using SQL-like languages (like Hive and Pig), based on MapReduce jobs. We first provide solutions that are based on specialized indices, which may themselves be accessed using either MapReduce or coordinator-based strategies. The first index-based solution is based on inverted indices, which are accessed with MapReduce jobs. The second index-based solution adapts a popular centralized rank-join algorithm. We further contribute a novel statistical structure comprising histograms and Bloom filters, which forms the basis for the third index-based solution. We provide (i) MapReduce algorithms showing how to build these indices and statistical structures, (ii) algorithms to allow for online updates to these indices, and (iii) query processing algorithms utilizing them. We implemented all algorithms in Hadoop (HDFS) and HBase and tested them on TPC-H datasets of various scales, utilizing different queries on tables of various sizes and different score-attribute distributions. We ported our implementations to Amazon EC2 and "in-house" lab clusters of various scales. We provide performance results for three metrics: query execution time, network bandwidth consumption, and dollar-cost for query execution.
We present a new algorithm for relational equijoin. The algorithm is a modification of merge join, but promises superior performance for medium-size inputs. In many cases it even compares favorably with both merge joi...
详细信息
We present a new algorithm for relational equijoin. The algorithm is a modification of merge join, but promises superior performance for medium-size inputs. In many cases it even compares favorably with both merge join and hybrid hash join, which is shown using analytic cost functions.
暂无评论