Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of dom...
详细信息
ISBN:
(纸本)9781665441742
Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.
Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently ...
详细信息
Stream Processing applications are spread across different sectors of industry and people's daily lives. The increasing data we produce, such as audio, video, image, and text are demanding quickly and efficiently computation. It can be done through Stream parallelism, which is still a challenging task and most reserved for experts. We introduce a Stream Processing framework for assessing parallel programming Interfaces (PPIs). Our framework targets multi-core architectures and C++ stream processing applications, providing an API that abstracts the details of the stream operators of these applications. Therefore, users can easily identify all the basic operators and implement parallelism through different PPIs. In this paper, we present the proposed framework, implement three applications using its API, and show how it works, by using it to parallelize and evaluate the applications with the PPIs Intel TBB, FastFlow, and SPar. The performance results were consistent with the literature.
This is the first edition of the PPEE workshop. The upcoming exascale systems will impose new requirements on application developers and programming systems to target platforms with hundreds of homogeneous and heterog...
详细信息
ISBN:
(纸本)9781665410380
This is the first edition of the PPEE workshop. The upcoming exascale systems will impose new requirements on application developers and programming systems to target platforms with hundreds of homogeneous and heterogeneous cores. The four critical challenges for exascale systems are extreme parallelism, power demand, data movement, and reliability. These systems are aimed to solve problems that were previously out of reach and to improve the parallel performance of applications by a factor of 50x. The power budget for achieving a billion billion (quintillion) floating-point operations per second (exaflops) should be within 20-30 MW. Moving the data on these systems relative to the computation will be challenging due to complex memory hierarchies. It would be essential to keep the CPUs/accelerators busy once they have the data to avoid memory bottlenecks. Failures on these systems are anticipated to occur many times a day, such that the existing approach for resiliency, such as checkpointing and restart, will not work.
作者:
Rodriguez, Diego A.Oteiza, Paola P.Brignole, Nelida B.UNS
CONICET Planta Piloto Ingn Quim PLAPIQUI Bahia Blanca Buenos Aires Argentina UNS
DIQ Bahia Blanca Buenos Aires Argentina UNS
Lab Invest & Desarrollo Comp Cient LIDECC DCIC Bahia Blanca Buenos Aires Argentina Univ Nacl Salta UNSa
Dept Informat Fac Ciencias Exactas Salta Argentina
An innovative optimization strategy by means of hyper-heuristics is proposed. It consists of a parallel combination of three metaheuristics. In view of the need both to escape from local optima and to achieve high div...
详细信息
An innovative optimization strategy by means of hyper-heuristics is proposed. It consists of a parallel combination of three metaheuristics. In view of the need both to escape from local optima and to achieve high diversity, the algorithm cooperatively combines simulated annealing with genetic algorithms and ant colony optimization. A location routing problem (LRP), which aims at the design of transport networks, was adopted for the performance evaluation of the proposed algorithm. Information exchanges took place effectively between the metaheuristics and speeded up the search process. Moreover, the parallel implementation was useful since it allowed several metaheuristics to run simultaneously, thus achieving a significant reduction in the computational time. The algorithmic efficiency and effectiveness were ratified for a medium-sized city. The proposed optimization algorithm not only accelerated computations, but also helped to improve solution quality.
The entire world of parallel computing endured a change when accelerators are gradually embraced in today's high-performance computing cluster. A hybrid CPU-GPU cluster is required to speed up the complex computat...
详细信息
ISBN:
(数字)9781665419604
ISBN:
(纸本)9781665429986
The entire world of parallel computing endured a change when accelerators are gradually embraced in today's high-performance computing cluster. A hybrid CPU-GPU cluster is required to speed up the complex computations by using parallel programming paradigms. This paper deals with performance evaluation of sequential, parallel and hybrid programming paradigms on the hybrid CPU-GPU cluster using the sorting strategies such as quick sort, heap sort and merge sort. In this research work performance comparison of C, MPI, and hybrid [MPI+CUDA] on CPU-GPUs hybrid systems are performed by using the sorting strategies. From the analysis it is observed that, the performance of parallel programming paradigm MPI is better when compared against sequential programming model. Also, research work evaluates the performance of CUDA on GPUs and hybrid programming model [MPI+CUDA] on CPU+GPU cluster using merge sort strategies and noticed that hybrid programming model [MPI+CUDA] has better performance against traditional approach and parallel programming paradigms MPI and CUDA When the overall performance of all three programming paradigms are compared, MPI+CUDA based on CPU+GPU environment gives the best speedup.
Analysis of processing time and similarity of images generated between CPU and GPU architectures and sequential and parallel programming. For image processing a computer with AMD FX-8350 processor and an Nvidia GTX 96...
详细信息
CPU-GPU based cluster computing in today's modern world encompasses the domain of complex and high-intensity computation. To exploit the efficient resource utilization of a cluster, traditional programming paradig...
详细信息
ISBN:
(数字)9781728195377
ISBN:
(纸本)9781728195384
CPU-GPU based cluster computing in today's modern world encompasses the domain of complex and high-intensity computation. To exploit the efficient resource utilization of a cluster, traditional programming paradigm is not sufficient. Therefore, in this article, the performance parallel programming paradigms like OpenMP on CPU cluster and CUDA on GPU cluster using BFS and DFS graph algorithms is analyzed. This article analyzes the time efficiency to traverse the graphs with the given number of nodes in two different processors. Here, CPU with OpenMP platform and GPU with CUDA platform support multi-thread processing to yield results for various nodes. From the experimental results, it is observed that parallelization with the OpenMP programming model using the graph algorithm does not boost the performance of the CPU processors, instead, it decreases the performance by adding overheads like idling time, inter-thread communication, and excess computation. On the other hand, the CUDA parallel programming paradigm on GPU yields better results. The implementation achieves a speed-up of 187 to 240 times over the CPU implementation. This comparative study assists the programmers provocatively and select the optimum choice among OpenMP and CUDA parallel programming paradigms.
The number of qubits in current quantum computers is a major restriction on their wider application. To address this issue, Ying conceived of using two or more small-capacity quantum computers to produce a larger-capa...
详细信息
Multicore processors are ubiquitous. Several prior research has emphasized the need for high productivity parallel programming models that require minimal changes to the sequential program and can still deliver high p...
详细信息
ISBN:
(纸本)9781665410380
Multicore processors are ubiquitous. Several prior research has emphasized the need for high productivity parallel programming models that require minimal changes to the sequential program and can still deliver high performance using runtimes based approaches on various architectures. In this paper, we present the structure and experience of teaching the Foundations of parallel programming course (FPP) at IIIT Delhi using a task-based parallel programming model, Habanero C/C++ Library (HClib). FPP covers a wide breadth of topics in parallel programming but emphasizes both high productivity and high performance. It is being offered at IIIT Delhi in the spring semester for undergraduate and postgraduate students since 2017. We describe our novel approach where the students start the learning process using the traditional parallel programming models, discover the underlying limitations, and build runtime solutions to achieve high performance.
In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Mu...
详细信息
In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), the second major part of the dynamic core of the EULAG geophysical model. For this aim, we develop a set of parametric optimization techniques and four-step procedure for customization of the MPDATA code. Among these techniques are: islands-of-cores strategy, (3+1)D decomposition, exploiting data parallelism and simultaneous multithreading, data flow synchronization, and vectorization. The proposed adaptation methodology helps us to develop the automatic transformation of the MPDATA code to achieve high sustained scalable performance for all tested ccNUMA platforms with Intel processors of last generations. This means that for a given platform, the sustained performance of the new code is kept at a similar level, independently of the problem size. The highest performance utilization rate of about 41-46% of the theoretical peak, measured for all benchmarks, is provided for any of the two-socket servers based on Skylake-SP (SKL-SP), Broadwell, and Haswell CPU architectures. At the same time, the four-socket server with SKL-SP processors achieves the highest sustained performance of around 1.0-1.1 Tflop/s that corresponds to about 33% of the peak.
暂无评论