CONTENTS vi Title: Implementation of parallel query processing in PostgreSQL Author: Bc. Daniel Vojtek Department: Department of Software Engineering Supervisor: Mgr. Július Štroffek Supervisors e-mail address: j...
详细信息
CONTENTS vi Title: Implementation of parallel query processing in PostgreSQL Author: Bc. Daniel Vojtek Department: Department of Software Engineering Supervisor: Mgr. Július Štroffek Supervisors e-mail address: julo@*** Abstract: parallel query processing can help with processing of huge amounts of data stored in database systems. The aim of this diploma the- sis was to explore the possibilities, analyze, design and finally implement parallel query processing in open source database system PostgreSQL. I used a Master/Worker design pattern, in which standard PostgreSQL backend process is a master. As workers I used processes created from postmaster. In the thesis I focused on preparing an infrastructure nec- essary for parallelprocessing. I defined a new top level memory context over shared memory, which allows efficient and convenient memory al- locations. Then I implemented creation of new worker processes, based on master process requirements. To be able to control these workers I defined controlling structures using state machines. Then I implemented parallel sort operation and SQL operator UNION ALL using this infras- tructure. The result of this diploma thesis is not only implementation of infrastructure and some parallel operations, but also description of the problems encountered during the...
One of the differences between relational and object-oriented databases (OODB) is that attributes in OODB can be of a collection type (e.g. sets, lists, arrays, and bags) as well as a simple type (e.g. integer, string...
详细信息
One of the differences between relational and object-oriented databases (OODB) is that attributes in OODB can be of a collection type (e.g. sets, lists, arrays, and bags) as well as a simple type (e.g. integer, string). Consequently, explicit join queries in OODB may be based on collection attributes. We call this type of join Collection Join Queries. There are three different kinds of collection join queries, namely: Collection-Equi Join, Collection-Intersect join, and Sub-Collection Join. Basically, a collection-equi join query checks an equality of both collection operands, whereas a collection-intersect join query checks whether there is an intersection between the two join collection attributes. Sub-collection join queries check whether one collection is a sub-collection of the other. In this paper, we present parallel join algorithms for the above three collection join query types based on the sort-merge technique. Sonic of the proposed algorithms employ a nested-loop construct as well. We also outline the complexity of collection merging in the algorithm. parallel join algorithms are normally composed of two stages, data partitioning and local join. For the data partitioning stage in collection-intersect and sub-collection join algorithms, we propose a 'Divide and Partial Broadcast' partitioning. The proposed join algorithms play an important role in parallel object-oriented queryprocessing, due to their superiority over the conventional join methods, which are usually in a form of relational division, and also the inefficiency of original join predicates processing.
In parallel database systems, parallelism is utilized to improve the efficiency of queryprocessing. However, parallelism is not equal to high efficiency. Therefore, query optimization techniques should be utilized to...
详细信息
In parallel database systems, parallelism is utilized to improve the efficiency of queryprocessing. However, parallelism is not equal to high efficiency. Therefore, query optimization techniques should be utilized to improve the efficiency of parallel query processing. In this paper, according to the characteristic of the object-oriented database and its query, based on the semi-join-based parallel query processing algorithm, the information flow based query optimization techniques are proposed, and the results of performance evaluation show that they are efficient and practical.
Collection join queries are join queries based on collection attributes (i.e. non-atomic attributes), which are common in object-oriented databases. We have identified three different kinds of collection join queries,...
详细信息
Collection join queries are join queries based on collection attributes (i.e. non-atomic attributes), which are common in object-oriented databases. We have identified three different kinds of collection join queries, namely;cullection-equijoin,collection-intersectjoin, andsub-collectionjoin. In this paper, we propose parallel join algorithms for these three collection join query types based on a combination of sort and hash methods, which we callparallel sort-hash, collection join algorithms. The proposed join algorithms play an important role in parallel object-oriented queryprocessing, due to their superiority over the conventional join methods which are usually in a form of relational division, and also the inefficiency of the original join predicate processing. In our implementation of these algorithms on a shared-memory machine, we show that the combination between sort and hash methods is proven to be better than the conventional sort-merge and nested-loop based parallel join processing
The GPGPU paradigm has recently been employed to accelerate the processing of big amounts of data through the utilization of the massive parallelism offered by modern GPUs. To date, several techniques have been propos...
详细信息
The GPGPU paradigm has recently been employed to accelerate the processing of big amounts of data through the utilization of the massive parallelism offered by modern GPUs. To date, several techniques have been proposed for the implementation of simple select, aggregate, and equality join operations on GPUs. In this paper, we study the efficient implementation of theta-join queries between two relations using the CUDA framework. Theta-joins are notoriously slow and thus can benefit from massively parallel execution. However, their GPU-based implementation significantly differs from hash- and sort-based equality joins and needs to be carefully crafted. The implementation is driven by two main objectives. The first relates to the attainment of high efficiency in the parallelization through data reuse, which relates to the minimization of accesses to the slow global memory. The second is about the most efficient exploitation of the available memory given that, in general, it cannot hold the entire input and result. We propose a methodology for processing theta-joins on a GPU, which exploits the heterogeneous nature of GPGPU, while addressing memory limitations. Furthermore, we provide a series of implementation optimizations, which yield performance improvements of an order of magnitude.
With the sharply increasing amount of data, studies on bigdata processing based on NoSQL have been actively done. However, NoSQL cannot satisfy the ACID properties of database transactions. Therefore, bigdata processi...
详细信息
With the sharply increasing amount of data, studies on bigdata processing based on NoSQL have been actively done. However, NoSQL cannot satisfy the ACID properties of database transactions. Therefore, bigdata processing based on RDBMS has been spotlighted. CUBRID Shard stores data in the distributed CUBRID servers by dividing the database. However, CUBRID Shard cannot process a query when data of a user is distributed on the multiple CUBRID servers. Therefore, in this paper we propose a CUBRID based middleware which supports distributed data processing. Through the performance evaluations, we show that our proposed scheme shows better performance than the existing work in terms of queryprocessing time.
Modern High-Performance Computing (HPC) data centers routinely store massive data sets resulting in millions of directories and billions of files. To efficiently search and sift through these files and directories we ...
详细信息
Modern High-Performance Computing (HPC) data centers routinely store massive data sets resulting in millions of directories and billions of files. To efficiently search and sift through these files and directories we present the Grand Unified File Index (GUFI), a novel file system metadata index that enables both privileged and regular users to rapidly locate and characterize data sets of interest. GUFI uses a hierarchical index that preserves file access permissions such that the index can be securely accessed by users while still enabling efficient, advanced analysis of storage system usage by cluster administrators. Compared with the current state-of-the-art indexing for file system metadata, GUFI is able to provide speedups of 1.5× to 230× for queries executed by administrators on a real production file system namespace. Queries executed by users, which typically cannot rely on cluster-wide indexing, see even greater speedups using GUFI.
This article proposes algorithms for evaluating XPath queries over an XML tree that is partitioned horizontally and vertically, and is distributed across a number of sites. The key idea is based on partial evaluation:...
详细信息
This article proposes algorithms for evaluating XPath queries over an XML tree that is partitioned horizontally and vertically, and is distributed across a number of sites. The key idea is based on partial evaluation: it is to send the whole query to each site that partially evaluates the query, in parallel, and sends the results as compact (Boolean) functions to a coordinator that combines these to obtain the result. This approach possesses the following performance guarantees. First, each site is visited at most twice for data-selecting XPath queries, and only once for Boolean XPath queries. Second, the network traffic is determined by the answer to the query, rather than the size of the tree. Third, the total computation is comparable to that of centralized algorithms on the tree stored in a single site, regardless of how the tree is fragmented and distributed. We also present a MapReduce algorithm for evaluating Boolean XPath queries, based on partial evaluation. In addition, we provide algorithms to evaluate XPath queries on very large XML trees, in a centralized setting. We show both analytically and empirically that our techniques are scalable with large trees and complex XPath queries. These results, we believe, illustrate the usefulness and potential of partial evaluation in distributed systems as well as centralized XML stores for evaluating XPath queries and beyond.
暂无评论