All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-off...
详细信息
ISBN:
(纸本)9781509036820
All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel_for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.
This special issue on heterogeneous computing is a follow-on of two well established workshops in the domain, namely HCW, the IEEE Heterogeneous Computing Workshop (held in Santa Fe in April 2004, in conjunction with ...
详细信息
This special issue on heterogeneous computing is a follow-on of two well established workshops in the domain, namely HCW, the IEEE Heterogeneous Computing Workshop (held in Santa Fe in April 2004, in conjunction with IPDPS) and HeteroPar, the international Workshop on algorithms, Models and Tools for parallel Computing on Heterogeneous Networks (held in Cork in July 2004, in conjunction with ISPDC). Networks of computers are the most commonly available parallel architecture now. Unlike dedicated parallel computer systems, networks are inherently heterogeneous. They consist of diverse computers of different performances interconnected via mixed network equipments providing communication links of different speeds and bandwidths. Traditional parallel algorithms and tools are aimed at homogeneous multiprocessors and cannot be efficiently used for parallel computing on heterogeneous networks. New ideas, dedicated algorithms, and tools are needed to efficiently use this new type of parallelarchitectures.
To deal with users' diversified query requirements on uncertain data, an uncertain data query semantic for requirement extension named RU-Topk is introduced. In the high-load application environment, the top-k que...
详细信息
Efficiently finding and computing statistics about "halos" (regions of high density) are essential analysis steps for N-body cosmology simulations. However, in state-of-the-art simulation codes, these analys...
详细信息
ISBN:
(纸本)9781467385176
Efficiently finding and computing statistics about "halos" (regions of high density) are essential analysis steps for N-body cosmology simulations. However, in state-of-the-art simulation codes, these analysis operators do not currently take advantage of the shared-memory data-parallelism available on multi-core and many-core architectures. The Hybrid / Hardware Accelerated Cosmology Code (HACC) is designed as an MPI+X code, but the analysis operators are parallelized only among MPI ranks, because of the difficulty in porting different X implementations (e.g., OpenMP, CUDA) across all architectures on which it is run. In this paper, we present portable data-parallel algorithms for several variations of halo finding and halo center finding algorithms. These are implemented with the PISTON component of the VTK-m framework, which uses Nvidia's Thrust library to construct data-parallel algorithms that allow a single implementation to be compiled to multiple backends to target a variety of multi-core and many-core architectures. Finally, we compare the performance of our halo and center finding algorithms against the original HACC implementations on the Moonlight, Stampede, and Titan supercomputers. The portability of Thrust allowed the same code to run efficiently on each of these architectures. On Titan, the performance improvements using our code have enabled halo analysis to be performed on a very large data set (81923 particles across 16,384 nodes of Titan) for which analysis using only the existing CPU algorithms was not feasible.
The abundant availability of multi-core computers makes "parallel computers" a common place and teaching Computer Science students to be able to design and develop parallel algorithms an urgent task. Most st...
详细信息
ISBN:
(纸本)9780769546766
The abundant availability of multi-core computers makes "parallel computers" a common place and teaching Computer Science students to be able to design and develop parallel algorithms an urgent task. Most students recognize the needs of developing skills in parallelprogramming. However, since their Computer Science related curriculum are mostly taught based on sequential computers, introducing a new way of analysis and solving problems can be difficult. Making students interested in the subject can have a pivotal effect in the learning outcomes. In this short paper, we show several approaches we have been using to excite our students about learning parallel processing at our Concurrent Systems class, where parallel processing and parallelprogramming are taught. Some approach include showing students algorithms with an appeared impossible high performance, showing them simple steps to achieve 100% CPU utilization on multi-core computers, combining sequential algorithms they learned in the past to create new parallel algorithms, and challenging them with implementing some rather complex parallel algorithms.
This paper presents the BaLinda model, based on last in/first out threads that interact via a shared tuplespace, and discusses the idea of using function-based objects as the basic unit of parallel execution and the h...
详细信息
ISBN:
(纸本)0818682596
This paper presents the BaLinda model, based on last in/first out threads that interact via a shared tuplespace, and discusses the idea of using function-based objects as the basic unit of parallel execution and the hierarchical structure to partition tuplespaces. It is argued that the two-level parallel execution, both within. and between objects, are well suited to scalable parallel platforms with shared memory nodes connected by high speed networks.
Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dyn...
详细信息
ISBN:
(纸本)9781595936677
Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive stages, in other words, the data dependence is non-uniform. This kind of dynamic programming is typically called nonserial polyadic dynamic programming. Owing to the: non-uniform data dependence;it is harder to optimize this problem for parallelism and locality on parallelarchitectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multi-core architecture. We present a programming and execution model for multi-core architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multi-thread and is easily improved with the technique. We formulate and analytically solve the optimization problem determing the the size that minimizes the total execution time. The experiments on a simulator give a validation of the proposed model and show that the fine grain parallel algorithm achieves sub-linear speedup and that a potential high scalability on multi-core arichitecture.
The rise of explicit parallelprogramming involves new problems: lack of structure for parallel algorithms and the ad hoc development of parallel algorithms. We use skeletons to characterize and design parallel algori...
详细信息
ISBN:
(纸本)0818675829
The rise of explicit parallelprogramming involves new problems: lack of structure for parallel algorithms and the ad hoc development of parallel algorithms. We use skeletons to characterize and design parallel algorithms and define a process to refine the designs step-by-step into programs. This paper introduces a high-level library on top of MPI which is derived from the skeleton concept to achieve better programmability and obtain portability. We conclude with a CFD application to demonstrate our idea.
FFT is a widely used algorithm, of which parallelization is a very important topic. There were a lot of works for this field and many parallel algorithms were published in several decades. In this paper, an algorithm ...
详细信息
There are two main approaches for designing parallel language. The first approach states that parallel computing demands new programming concepts and radical intellectual changes regarding the way we think about progr...
详细信息
ISBN:
(纸本)0769522106
There are two main approaches for designing parallel language. The first approach states that parallel computing demands new programming concepts and radical intellectual changes regarding the way we think about programming, as compared to sequential computing. Therefore, the design of such a parallel language must present new constructs and new programming methodologies. The second approach states that there is no need to reinvent the wheel, and serial languages can be extended to support parallelism. The motivation behind this approach is to keep the language as friendly as possible for the programmer who is the main bridge toward wider acceptance of the new language. In this paper we present a qualitative evaluation of two contemporary parallel languages: OpenMP-C and Unified parallel C (UPC). Both are explicit parallelprogramming languages based on the ANSI C standard. OpenMP-C was designed for shared-memory architectures and extends the base-language by using compiler directives that annotate the original source-code. On the other hand, UPC was designed for distribute-shared memory architectures and extends the base-language by new parallel constructs. We deconstruct each parallel language into its basic components, show examples, make a detailed analysis, compare them, and finally draw some conclusions.
暂无评论