Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes withthe flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling too...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes withthe flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. these interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behavior. this work presents StatTask, a novel statistical cache model that can predict cache behavior for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.
OpenACC has been on development for a few years now. the OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
OpenACC has been on development for a few years now. the OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. there is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. there are, however, more developments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.
Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for ...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
Execution plans constitute the traditional interface between DBMS front-ends and back-ends;similar networks of interconnected operators are found also outside database systems. Tasks like adapting execution plans for distributed or heterogeneous runtime environments require a plan transformation mechanism which is simple enough to produce predictable results while general enough to express advanced communication schemes required for instance in skew-resistant partitioning. In this paper, we describe the BobolangNG language designed to express execution plans as well as their transformations, based on hierarchical models known from many environments but enhanced with a novel compile-time mechanism of component multiplication. Compared to approaches based on general graph rewriting, the plan transformation in BobolangNG is not iterative;therefore the consequences and limitations of the process are easier to understand and the development of distribution strategies and experimenting with distributed plans are easier and safer.
After the emergence of the new High Efficiency Video Coding standard, several strategies have been followed in order to take advantage of the parallel features available in it. Many of the parallelization approaches i...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
After the emergence of the new High Efficiency Video Coding standard, several strategies have been followed in order to take advantage of the parallel features available in it. Many of the parallelization approaches in the literature have been performed in the decoder side, aiming at achieving real-time decoding. However, the most complex part of the HEVC codec is the encoding side. In this paper, we perform a comparative analysis of two parallelization proposals. One of them is based on tiles, employing shared memory architectures and the other one is based on Groups Of Pictures, employing distributed shared memory architectures. the results show that good speed-ups are obtained for the tile-based proposal, especially for high resolution video sequences, but the scalability decreases for low resolution video sequences. the GOP-based proposal outperforms the tile-based proposal when the number of processes increases. this benefit grows up when low resolution video sequences are compressed.
In this paper we present an approach to the parallel simulation of the heart electrical activity using the finite element method withthe help of the FEniCS automated scientific computing framework. FEniCS allows scie...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
In this paper we present an approach to the parallel simulation of the heart electrical activity using the finite element method withthe help of the FEniCS automated scientific computing framework. FEniCS allows scientific software development using the near-mathematical notation and provides automatic parallelization on MPI clusters. We implemented the ten Tusscher-Panfilov (TP06) cell model of cardiac electrical activity. the scalability testing of the implementation was performed using up to 240 CPU cores and the 95 times speedup was achieved. We evaluated various combinations of the Krylov parallel linear solvers and the preconditioners available in FEniCS. the best performance was provided by the conjugate gradient method and the biconjugate gradient stabilized method solvers withthe successive over-relaxation preconditioner. Since the FEniCS-based implementation of TP06 model uses notation close to the mathematical one, it can be utilized by computational mathematicians, biophysicists, and other researchers without extensive parallel computing skills.
For particular real world combinatorial optimization problems e.g. the longest common subsequence problem (LCSSP) from Bioinformatics, determining multiple optimal solutions (DMOS) is quite useful for experts. However...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
For particular real world combinatorial optimization problems e.g. the longest common subsequence problem (LCSSP) from Bioinformatics, determining multiple optimal solutions (DMOS) is quite useful for experts. However, for large size problems, this may be too time consuming, thus the resort to parallel computing. We address here the parallelization of an algorithm for DMOS for the LCSSP. Considering the dynamic programming algorithm solving it, we derive a generic algorithm for DMOS (A-DMOS). Since the latter is a non perfect DO-loop nest, we adopt a three-step approach. the first consists in transforming the A-DMOS into a perfect nest. the second consists in choosing the granularity and the third carries out a dependency analysis in order to determine the type of each loop i.e. either parallel or serial. the practical performances of our approach are evaluated through experimentations achieved on input benchmarks and random DNA sequences and targeting a parallel multicore machine.
the optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. this problem is known to be NP-hard, and the state-of-the-a...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
the optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. this problem is known to be NP-hard, and the state-of-the-art algorithm requires exponential time and space. It is thus not feasible to solve large instances using a single processor. Some parallelalgorithms have therefore been developed to solve larger instances. A recently proposed parallel algorithm can solve an instance of 33 vertices, and this is the largest solved size reported thus far. In the study presented in this paper, we developed a novel parallel algorithm designed specifically to operate on a parallel computer with a torus network. Our algorithm crucially exploits the torus network structure, thereby obtaining good scalability. through computational experiments, we confirmed that a run of our proposed method using up to 20,736 cores showed a parallelization efficiency of 0.94 as compared to a 1296-core run. Finally, we successfully computed an optimal DAG structure for an instance of 36 vertices, which is the largest solved size reported in the literature.
the rapid growth of supercomputer technologies became a driver for the development of natural sciences. Most of the discoveries in astronomy, in physics of elementary particles, in the design of new materials in the D...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
the rapid growth of supercomputer technologies became a driver for the development of natural sciences. Most of the discoveries in astronomy, in physics of elementary particles, in the design of new materials in the DNA research are connected with numerical simulation and with supercomputers. Supercomputer simulation became an important tool for the processing of the great volume of the observation and experimental data accumulated by the mankind. Modern scientific challenges put the actuality of the works in computer systems and in the scientific software design to the highest level. the architecture of the future exascale systems is still being discussed. Nevertheless, it is necessary to develop the algorithms and software for such systems right now. It is necessary to develop software that is capable of using tens and hundreds of thousands of processors and of transmitting and storing of large volumes of data. In the present work the technology for the development of such algorithms and software is proposed. As an example of the use of the technology, the process of the software development is considered for some problems of astrophysics.
there is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updat...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
there is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updated sequentially. the main contribution of this paper is to present a new lossless data compression method that we call Light Loss-Less (LLL) compression. It is designed so that decompression can be highly parallelized and run very efficiently on the GPU. this makes sense for many applications in which compressed data is read and decompressed many times and decompression performed more frequently than compression. We show optimal sequential and parallelalgorithms for LLL decompression and implement them to run on Core i7-4790 CPU and GeForce GTX 1080 GPU, respectively. To show the potentiality of LLL compression method, we have evaluated the running time using five images and compared with well-known compression methods LZW and LZSS. Our GPU implementation of LLL decompression runs 91.1-176 times faster than the CPU implementation. Also, the running time on the GPU of our experiments show that LLL decompression is 2.49-9.13 times faster than LZW decompression and 4.30-14.1 times faster that LZSS decompression, although their compression ratios are comparable.
the High Efficiency Video Coding (HEVC) standard has nearly doubled the compression efficiency of prior standards. Nonetheless, this increase in coding efficiency involves a notably higher computing complexity that sh...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
the High Efficiency Video Coding (HEVC) standard has nearly doubled the compression efficiency of prior standards. Nonetheless, this increase in coding efficiency involves a notably higher computing complexity that should be overcome in order to achieve real-time encoding. For this reason, this paper focuses on applying parallelprocessing techniques to the HEVC encoder withthe aim of reducing significantly its computational cost without affecting the compression performance. Firstly, we propose a coarse-grained slice-based parallelization technique that is executed in a multi-core CPU, and then, with finer level of parallelism, a GPU-based motion estimation algorithm. Both techniques define a heterogeneous parallel coding architecture for HEVC. Results show that speed-ups of up to 4.06x can be obtained on a quad-core platform with low impact in coding performance.
暂无评论