We propose a new partition-based join algorithm, called select-partitioned join, which performs better than the sort-merge and hash-partitioned join. The proposed select-partitioned join algorithm consists of three ma...
详细信息
We propose a new partition-based join algorithm, called select-partitioned join, which performs better than the sort-merge and hash-partitioned join. The proposed select-partitioned join algorithm consists of three major steps. The first step is to determine a partitioning pattern by which the total join cost can be minimized and choose the bound values of the buckets by using a selection algorithm. The second is to partition both relations into ranged buckets according to the partitioning pattern and the bound values chosen in the previous step. And the last step is to apply the nested-block join on the partitioned bucket pairs. The selection algorithm is based on the cumulative distribution function and it is performed by a single scan of the smaller relation. The performance of the select-partitioned join is analyzed in terms of the number of I/Os and compared with the sort-merge and hash-partitioned join algorithms. Our join algorithm is better than a hash-partitioned join algorithm, which is, in general, known to be better for the join operation. Simulation experiments are conducted for the join algorithms.
This paper proposes an algorithm that improves the sort-based join method. Unlike the sort-based join, it employs both sorting and partitioning for avoiding two complete sorts of both relations, thus it will be referr...
详细信息
This paper proposes an algorithm that improves the sort-based join method. Unlike the sort-based join, it employs both sorting and partitioning for avoiding two complete sorts of both relations, thus it will be referred to as hybrid join. The algorithm consists of completely sorting only the smaller relation and partitioning the other one into ranged buckets according to the order statistics of the sorted relation. The final join is performed on the sorted relation and the ranged buckets.
Data warehousing continues to play an important role in global information systems for businesses. Meanwhile, applications of data warehousing have evolved from reporting and decision support systems to mission critic...
详细信息
ISBN:
(纸本)9781424451661
Data warehousing continues to play an important role in global information systems for businesses. Meanwhile, applications of data warehousing have evolved from reporting and decision support systems to mission critical decision making systems. This requires data warehouses to combine both historical and current data from operational systems. Since a join operation is one of the most expensive operations in query processing, it is vital to develop effective and efficient join techniques for a distributed warehouse environment. In this paper, we propose an agent-based adaptive join algorithm called Ajoin for effective and efficient online join operations in distributed data warehouses. Ajoin utilises intelligent agents for dynamic optimisation and coordination of join processing at run time. Key aspects of the Ajoin algorithm have been implemented and evaluated against other modern adaptive join algorithms. It has been shown that Ajoin exhibits better performance under various distributed and dynamic data warehouse environments in our study.
Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based ...
详细信息
Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B+-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.
A join operation consists of selecting from the set of all pairs of records of 2 files those pairs that possess some matching property. Because the join operation is so important in database applications, several alg...
详细信息
A join operation consists of selecting from the set of all pairs of records of 2 files those pairs that possess some matching property. Because the join operation is so important in database applications, several algorithms have been proposed for performing it efficiently. A sort-based join is performed by completely sorting both files, then joining the 2 sorted files in one pass. The one-way join-during-merge algorithm consists of completely sorting one file and only partially sorting the other file and then performing the join. Two join-during-merge algorithms are described and analyzed and their superiority with respect to the traditional sort-based algorithm is shown. It is suggested that the join-during-merge algorithm be used in all cases in which the sort-based algorithm was considered convenient. Since it is possible to choose the optimal algorithm before performing the join, the stated gains can be achieved in reality.
Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the ma...
详细信息
ISBN:
(纸本)9781450371810
Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the communication cost. Apache Spark is highly scalable, fault-tolerance, and can be used across many computers. So join algorithm is one of the most widely used algorithms in database systems, but it is also a heavily time consuming operation. In this work, we will survey and criticize several implementations of Spark join algorithms and discuss their strengths and weaknesses, present a detailed comparison of these algorithms, and introduce optimization approaches to enhance and tune the performance of join algorithms.
When a multidatabase system contains textual database systems (i.e., information retrieval systems), queries against the global schema of the multidatabase system may contain a new type of joins-joins between attribut...
详细信息
When a multidatabase system contains textual database systems (i.e., information retrieval systems), queries against the global schema of the multidatabase system may contain a new type of joins-joins between attributes of textual type. Three algorithms for processing such a type of joins are presented and their I/O costs are analyzed in this paper. Since such a type of joins often involves document collections of very large size, it is very important to find efficient algorithms to process them. The three algorithms differ on whether the documents themselves or the inverted files on the documents are used to process the join. Our analysis and the simulation results indicate that the relative performance of these algorithms depends on the input document collections, system characteristics, and the input query. For each algorithm, the type of input document collections with which the algorithm is likely to perform well is identified. An integrated algorithm that automatically selects the best algorithm to use is also proposed.
join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms base...
详细信息
join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.
The growing data have brought tremendous pressure for query processing and storage, so there are many studies that focus on using GPU to accelerate join operation, which is one of the most important operations in mode...
详细信息
The growing data have brought tremendous pressure for query processing and storage, so there are many studies that focus on using GPU to accelerate join operation, which is one of the most important operations in modern database systems. However, existing GPU acceleration join operation researches are not very suitable for the join operation on big data. Based on this, this paper speeds up nested loop join, hash join and theta join, combining Hadoop with GPU, which is also the first to use GPU to accelerate theta join. At the same time, after the data pre-filtering and pre-processing, using MapReduce and HDFS in Hadoop proposed in this paper, the larger data table can be handled, compared to existing GPU acceleration methods. Also with MapReduce in Hadoop, the algorithm proposed in this paper can estimate the number of results more accurately and allocate the appropriate storage space without unnecessary costs, making it more efficient. Experimental results show that comparing with GPU-based approach without Hadoop, our approach increases the speed by 1.5-2 times, and comparing with the Hadoop-based approaches without GPU, our approach increases the speed by 1.3-2 times.
One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce's basic hash-based partitioning method cannot solve the problem properly, two alternatives have been propos...
详细信息
One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce's basic hash-based partitioning method cannot solve the problem properly, two alternatives have been proposed: range-based and randomized methods. However, they still remain some drawbacks: the range-based method does not handle join product skew, and the randomized method performs worse than the basic hash-based partitioning when input relations are not skewed. In this paper, we present a new skew handling method, called multi-dimensional range partitioning (MDRP). The proposed method overcomes the limitations of traditional algorithms in two ways: 1) the number of output records expected at each, machine is considered, which leads to better handling of join product skew, and 2) a small number of input records are sampled before the actual join begins so that an efficient execution plan considering the degree of data skew can be created. As a result, in a scalar skew experiment, the proposed join algorithm is about 6.76 times faster than the range-based algorithm when join product skew exists and about 5.14 times than the randomized algorithm when input relations are not skewed. Moreover, through the worst-case analysis, we show that the input and the output imbalances are less than or equal to 2. The proposed algorithm does not require any modification to the original MapReduce environment and is applicable to complex join operations such as theta joins and multi-way joins. (C) 2016 Elsevier Ltd. All rights reserved.
暂无评论