Programmers are faced with many challenges for obtaining performance on machines with increasingly capable, yet increasingly complex hardware. A trend towards task-parallel and asynchronous many-task programming model...
详细信息
ISBN:
(数字)9783319527093
ISBN:
(纸本)9783319527093;9783319527086
Programmers are faced with many challenges for obtaining performance on machines with increasingly capable, yet increasingly complex hardware. A trend towards task-parallel and asynchronous many-task programming models aim to alleviate the burden of parallel programming on a vast array of current and future platforms. One such model, Concurrent Collections (CnC), provides a programming paradigm that emphasizes the separation of the concerns-domain experts concentrate on their algorithms and correctness, whereas performance experts handle mapping and tuning to a target platform. Deep understanding of parallel constructs and behavior is not necessary to write parallel applications that will run on various multi-threaded and multi-core platforms when using the CnC model. However, performance can vary greatly depending on the granularity of tasks and data declared by the programmer. These program-specific decisions are not part of the CnC tuning capabilities and must be tuned in the program. We analyze the performance behavior based on tuning various elements in each collection for the LULESH application using CnC. We demonstrate the effects of different techniques to modify task and data granularity in CnC collections. Our fully tiled CnC implementation outperforms the OpenMP counterpart by 3x for 48 processors. Finally, we propose guidelines to emulate the techniques used to obtain high performance while improving programmability.
In an introductory computational physics class of the type that many of us give, time constraints lead to hard choices on topics. Everyone likes to include their own research in such a class but an overview of many ar...
详细信息
In an introductory computational physics class of the type that many of us give, time constraints lead to hard choices on topics. Everyone likes to include their own research in such a class but an overview of many areas is paramount. parallel programming algorithms using MPI is one important topic. Both the principle and the need to break the "fear barrier" of using a large machine with a queuing system via ssh must be sucessfully passed on. Due to the plateau in chip development and to power considerations future HPC hardware choices will include heavy use of GPUs. Thus the need to introduce these at the level of an introductory course has arisen. Just as for parallel coding, explanation of the benefits and simple examples to guide the hesitant first time user should be selected. Several student projects using GPUs that include how-to pages were proposed at the Technion. Two of the more successful ones were lattice Boltzmann and a finite element code, and we present these in detail.
In this paper, we provide comparison of language features and runtime systems of commonly used threading parallel programming models for high performance computing, including OpenMP, Intel Cilk Plus, Intel TBB, OpenAC...
详细信息
ISBN:
(纸本)9780769561493
In this paper, we provide comparison of language features and runtime systems of commonly used threading parallel programming models for high performance computing, including OpenMP, Intel Cilk Plus, Intel TBB, OpenACC, Nvidia CUDA, OpenCL, C++11 and PThreads. We then report our performance comparison of OpenMP, Cilk Plus and C++11 for data and task parallelism on CPU using benchmarks. The results show that the performance varies with respect to factors such as runtime scheduling strategies, overhead of enabling parallelism and synchronization, load balancing and uniformity of task workload among threads in applications. Our study summarizes and categorizes the latest development of threading programming APIs for supporting existing and emerging computer architectures, and provides tables that compare all features of different APIs. It could be used as a guide for users to choose the APIs for their applications according to their features, interface and performance reported.
The ubiquity of multi- and many-core processors means that many general purpose programmers are beginning to face the difficult task of using runtime systems designed for large-scale parallelism. Not only do they have...
详细信息
ISBN:
(纸本)9781538626528
The ubiquity of multi- and many-core processors means that many general purpose programmers are beginning to face the difficult task of using runtime systems designed for large-scale parallelism. Not only do they have to deal with finding and exploiting irregular parallelism through Tasking, but they have to deal with runtime systems that require an expert tuning of task granularity and scheduling for performance. This paper provides hands-on experiences to help programmers to select an appropriate tasking model and design programs. It investigates the scheduling strategies of three different runtime tasking models: Cilk, OpenMP and High Performance ParalleX (HPX-5). Six different simple benchmarks are used to expose how well each runtime performs when provided untuned implementations of irregular code fragments. The benchmarks, which have irregular and dynamic structures, provide information about the pros and cons of each system's runtime model, particularly the differences to the programmer between help-first and work-first scheduling.
Cost models play an important role for the efficient implementation of software systems. These models can be embedded in operating systems and execution environments to optimize execution at run time. Even though non-...
详细信息
ISBN:
(纸本)9780769561493
Cost models play an important role for the efficient implementation of software systems. These models can be embedded in operating systems and execution environments to optimize execution at run time. Even though non-uniform memory access (NUMA) architectures are dominating today's server landscape, there is still a lack of parallel cost models that represent NUMA system sufficiently. Therefore, the existing NUMA models are analyzed, and a two-step performance assessment strategy is proposed that incorporates low-level hardware counters as performance indicators. To support the two-step strategy, multiple tools are developed, all accumulating and enriching specific hardware event counter information, to explore, measure, and visualize these low-overhead performance indicators. The tools are showcased and discussed alongside specific experiments in the realm of performance assessment.
The work describes a flexible framework built to generate various (parallel) software versions and to benchmark them. The framework is written with the use of the Python language with some support of the gnuplot plott...
详细信息
ISBN:
(纸本)9788394625375
The work describes a flexible framework built to generate various (parallel) software versions and to benchmark them. The framework is written with the use of the Python language with some support of the gnuplot plotting program. An example of the use of this tool shows the tuning of a matrix factorization on different architectures (Intel Haswell and Intel Knights Corner) with various parameters of parallelization, vectorization, blocking etc.
String matching refers to the search of each and every occurrence of a string in another string. Nowadays, this issue presents itself in various segments in a great deal, starting from standard programs for text editi...
详细信息
ISBN:
(纸本)9781538640081
String matching refers to the search of each and every occurrence of a string in another string. Nowadays, this issue presents itself in various segments in a great deal, starting from standard programs for text editing and processing, through databases and all the way to their various applications in other sciences. There are numerous different efficient algorithms to solve this problem. One of the efficient algorithms is Rabin-Karp algorithm which has complexity of O(m(n-m+1)) whereas the complexity of proposed advanced Rabin-Karp algorithm is O(n-m). However, the main focus of this research is to apply the concepts of parallelism to improve the performance of the algorithm. There are lots of parallel processing Application programming Interfaces (APIs) available, like OpenMP, MPI, CUDA MapReduce, etc. out of these we have chosen OpenMP and CUDA to achieve parallelism. Comparison of the results of both serial and parallel implementations will give us insights into how performance and efficiency is achieved through various techniques of parallelism.
The FMCAD Student Forum provides a platform for graduate students at any career stage to introduce their research to the wider Formal Methods community, and solicit feedback. In 2017, the event took place in Vienna, A...
详细信息
ISBN:
(纸本)9780983567875
The FMCAD Student Forum provides a platform for graduate students at any career stage to introduce their research to the wider Formal Methods community, and solicit feedback. In 2017, the event took place in Vienna, Austria, as integral part of the FMCAD conference. Thirteen students were invited to give a short talk and present a poster illustrating their work. The presentations covered a broad range of topics in the field of verification, such as automated reasoning, model checking of hardware, software, as well as parameterized systems, verification of concurrent programs, and checking of floating point properties.
The BSP model (Bulk Synchronous parallel) simplifies the construction and evaluation of parallel algorithms, with its simplified synchronization structure and cost model. Nevertheless, imperative BSP programs can suff...
详细信息
The BSP model (Bulk Synchronous parallel) simplifies the construction and evaluation of parallel algorithms, with its simplified synchronization structure and cost model. Nevertheless, imperative BSP programs can suffer from synchronization errors. Programs with textually aligned barriers are free from such errors, and this structure eases program comprehension. We propose a simplified formalization of barrier inference as data flow analysis, which verifies statically whether an imperative BSP program has replicated synchronization, which is a sufficient condition for textual barrier alignment. (C) 2017 The Authors. Published by Elsevier B. V.
We describe our approach in augmenting the BEAGLE library for high-performance statistical phylogenetic inference to support concurrent computation of independent partial likelihoods arrays. Our solution involves iden...
详细信息
ISBN:
(纸本)9783319654829;9783319654812
We describe our approach in augmenting the BEAGLE library for high-performance statistical phylogenetic inference to support concurrent computation of independent partial likelihoods arrays. Our solution involves identifying independent likelihood estimates in analyses of partitioned datasets and in proposed tree topologies, and configuring concurrent computation of these likelihoods via CUDA and opencL frameworks. We evaluate the effect of each increase in concurrency on throughput performance for our partial likelihoods kernel for a four-state nucleotide substitution model on a variety of parallel computing hardware, such as NVIDIA and AMD GPU5, and Intel multicore cPus, observing up to 16-fold speedups over our previous implementation. Finally, we evaluate the effect of these gains on an domain application program, MrBayes. For a partitioned nucleotide-model analysis we observe an average speedup for the overall run time of 2.1-fold over our previous parallel implementation, and 10-fold over the native MrBayes with SSE.
暂无评论