Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for ...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for distributed or heterogeneous runtime environments require a plan transformation mechanism which is simple enough to produce predictable results while general enough to express advanced communication schemes required for instance in skew-resistant partitioning. In this paper, we describe the BobolangNG language designed to express execution plans as well as their transformations, based on hierarchical models known from many environments but enhanced with a novel compile-time mechanism of component multiplication. Compared to approaches based on general graph rewriting, the plan transformation in BobolangNG is not iterative;therefore the consequences and limitations of the process are easier to understand and the development of distribution strategies and experimenting with distributed plans are easier and safer.
the increasing need for computing power today justifies the continuous search for techniques that decrease the time to answer usual computational problems. To take advantage of new hybrid parallelarchitectures compos...
详细信息
ISBN:
(纸本)9781509043200
the increasing need for computing power today justifies the continuous search for techniques that decrease the time to answer usual computational problems. To take advantage of new hybrid parallelarchitectures composed by multithreading and multiprocessor hardware, our current efforts involve the design and validation of highly parallelalgorithmsthat efficently explore the characteristics of such architectures. In this paper, we propose an automatic tuning methodology to easily exploit multicore, multi- GPU and coprocessor systems. We present an optimization of an algorithm for solving triangular systems (TRSM), based on block decomposition and asynchronous task assignment, and discuss some results.
the LogP model was used to measure the effects of latency, occupancy and bandwidth on distributed memory multiprocessors. the idea was to characterize distributed memory multiprocessor using these key parameters, stud...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
the LogP model was used to measure the effects of latency, occupancy and bandwidth on distributed memory multiprocessors. the idea was to characterize distributed memory multiprocessor using these key parameters, studying their impacts on performance in simulation environments. this work proposes a new model, based on LogP, that describes the impacts on performance of applications executing on a heterogeneous cluster. this model can be used, in a near future, to help choose the best way to split a parallel application to be executed on this architecture. the model considers that a heterogeneous cluster is composed by distinct types of processors, accelerators and networks.
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have ...
详细信息
ISBN:
(纸本)9781509052523
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have been proposed and implemented using sequential computing, alternative parallel solutions provide more suitable and high performance solutions to the same problems. In this paper, three parallelization strategies are proposed and implemented for a dynamic programming based cloud smoothing application, using both shared memory and non-shared memory approaches. the experiments are performed on NVIDIA GeForce GT750m and Tesla K20m, two GPU accelerators of Kepler architecture. Detailed performance analysis is presented on partition granularity at block and thread levels, memory access efficiency and computational complexity. the evaluations described show high approximation of results with high efficiency in the parallel implementations, and these strategies can be adopted in similar data analysis and processing applications.
the optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. this problem is known to be NP-hard, and the state-of-the-a...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
the optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. this problem is known to be NP-hard, and the state-of-the-art algorithm requires exponential time and space. It is thus not feasible to solve large instances using a single processor. Some parallelalgorithms have therefore been developed to solve larger instances. A recently proposed parallel algorithm can solve an instance of 33 vertices, and this is the largest solved size reported thus far. In the study presented in this paper, we developed a novel parallel algorithm designed specifically to operate on a parallel computer with a torus network. Our algorithm crucially exploits the torus network structure, thereby obtaining good scalability. through computational experiments, we confirmed that a run of our proposed method using up to 20,736 cores showed a parallelization efficiency of 0.94 as compared to a 1296-core run. Finally, we successfully computed an optimal DAG structure for an instance of 36 vertices, which is the largest solved size reported in the literature.
As a representation of high connected objects, graphs receive a arising attention. By virtue of the interconnection of graph data, current general-purpose parallel data processing systems misfit effectively graph proc...
详细信息
ISBN:
(纸本)9781509021291
As a representation of high connected objects, graphs receive a arising attention. By virtue of the interconnection of graph data, current general-purpose parallel data processing systems misfit effectively graph processing. thus, a wide spectrum of dedicated graph processing system emerged. In this paper, we give a guidance of classical types of graph processing system. We discuss key features and the according challenges of graph processing from the aspect of graph data, graph algorithm as well as the computation implementation. then we specify four strategies that should be taken into account when designing a graph processing systems. In the last part of our paper we make a comparison of present typical graph processing systems and specify their suitable application area.
there is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updat...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
there is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updated sequentially. the main contribution of this paper is to present a new lossless data compression method that we call Light Loss-Less (LLL) compression. It is designed so that decompression can be highly parallelized and run very efficiently on the GPU. this makes sense for many applications in which compressed data is read and decompressed many times and decompression performed more frequently than compression. We show optimal sequential and parallelalgorithms for LLL decompression and implement them to run on Core i7-4790 CPU and GeForce GTX 1080 GPU, respectively. To show the potentiality of LLL compression method, we have evaluated the running time using five images and compared with well-known compression methods LZW and LZSS. Our GPU implementation of LLL decompression runs 91.1-176 times faster than the CPU implementation. Also, the running time on the GPU of our experiments show that LLL decompression is 2.49-9.13 times faster than LZW decompression and 4.30-14.1 times faster that LZSS decompression, although their compression ratios are comparable.
the High Efficiency Video Coding (HEVC) standard has nearly doubled the compression efficiency of prior standards. Nonetheless, this increase in coding efficiency involves a notably higher computing complexity that sh...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
the High Efficiency Video Coding (HEVC) standard has nearly doubled the compression efficiency of prior standards. Nonetheless, this increase in coding efficiency involves a notably higher computing complexity that should be overcome in order to achieve real-time encoding. For this reason, this paper focuses on applying parallelprocessing techniques to the HEVC encoder withthe aim of reducing significantly its computational cost without affecting the compression performance. Firstly, we propose a coarse-grained slice-based parallelization technique that is executed in a multi-core CPU, and then, with finer level of parallelism, a GPU-based motion estimation algorithm. Both techniques define a heterogeneous parallel coding architecture for HEVC. Results show that speed-ups of up to 4.06x can be obtained on a quad-core platform with low impact in coding performance.
In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithmsthat are only ap...
详细信息
暂无评论