Algorithmic skeletons are polymorphic higher-order functions that represent common parallelization patterns and that are implemented in parallel. They can be used as the building blocks of parallel and distributed app...
详细信息
Algorithmic skeletons are polymorphic higher-order functions that represent common parallelization patterns and that are implemented in parallel. They can be used as the building blocks of parallel and distributed applications by embedding them into a sequential language. In this paper, we present a new approach to programming with skeletons. We integrate the skeletons into an imperative host language enhanced with higher-order functions and currying, as well as with a polymorphic type system. We thus obtain a high-levelprogramming language, which can be implemented very efficiently. We then present a compile-time technique for the implementation of the functional features which has an important positive impact on the efficiency of the language. After describing a series of skeletons which work with distributed arrays, we give two examples of parallel algorithms implemented in our language, namely matrix multiplication and Gaussian elimination. Run-time measurements for these and other applications show that we approach the efficiency of message-passing C up to a factor between 1 and 1.5. (C) 1998-Elsevier Science B.V. All rights reserved.
parallelprogramming for an infrastructure of multi-core or many-core clusters is a challenge for developers without experience in this domain. Developers need to use several libraries such as MPI, OpenMP, and CUDA to...
详细信息
ISBN:
(纸本)9781450359337
parallelprogramming for an infrastructure of multi-core or many-core clusters is a challenge for developers without experience in this domain. Developers need to use several libraries such as MPI, OpenMP, and CUDA to efficiently use the hardware which may include additional accelerators such as GPUs. Also, performing low-level optimizations is required in order to reach high performance. One approach to overcome these issues is the concept of Algorithmic Skeletons. These are instances of typical patterns for parallelprogramming, such as map, fold, and zip, which can simply be composed by an application programmer without taking care of low-levelprogramming aspects. We propose a domain-specific language called Musket that includes algorithmic skeletons as domain abstractions which seamlessly integrate with sequential code while aligning with the C++ programming language for fast learnability. For improved usability, the editing component validates the correctness of models and provides solution hints in the integrated development environment. From the naive program specification, automatic transformations are applied in order to optimize the code for parallel execution. Subsequently, low-level C++ programs are generated which are optimized for multi-core parallelism on a cluster infrastructure. We evaluate the language using benchmark models written in our DSL and compare the execution time and speedup achieved through model preprocessing and code generation. Our experimental results show that the performance of Musket programs can be significantly improved through intermediate optimizations. The DSL approach thus simplifies multi-core application development and enables performance optimizations through model transformations.
Multithreading is the core of mainstream heterogeneous programming methods such as CUDA and OpenCL. However, multithreaded parallelprogramming requires programmers to handle low-level runtime details, making the prog...
详细信息
Multithreading is the core of mainstream heterogeneous programming methods such as CUDA and OpenCL. However, multithreaded parallelprogramming requires programmers to handle low-level runtime details, making the programming process complex and error prone. This paper presents no-threading (NoT), a high-level no-threading programming method. It introduces the association structure, a new language construct, to provide a declarative runtime-free expression of different data parallelisms and avoid the use of multithreading. The NoT method designs C-like syntax for the association structure and implements a compiler and runtime system using OpenCL as an intermediate language. We demonstrate the effectiveness of our techniques with multiple benchmarks. The size of the NoT code is comparable to that of the serial code and is far less than that of the benchmark OpenCL code. The compiler generates efficient OpenCL code, yielding a performance competitive with or equivalent to that of the manually optimized benchmark OpenCL code on both a GPU platform and an MIC platform.
Over the past decade, the widespread adoption of RNA-seq methodology for transcript-level monitoring has resulted in a surge of biological data requiring comprehensive analysis. The BioSkel project aims to develop a f...
详细信息
Over the past decade, the widespread adoption of RNA-seq methodology for transcript-level monitoring has resulted in a surge of biological data requiring comprehensive analysis. The BioSkel project aims to develop a framework for RNA sequencing analysis on multi/many-core machines. This framework relies on generic and modular high-levelparallel patterns, enabling biologists to customize their data processing to their specific needs while abstracting away the complexities of parallelization. In this study, we introduce the initial prototype of BioSkel for RNA sequencing analysis, which comprises three main steps: sequence alignment, feature counting, and differential expression analysis. This prototype leverages FastFlow as a back-end for parallelizing the execution, either in shared- and distributed-memory. We provide experimental validations of our approach, considering different architectures and dataset sizes. As a valuable byproduct, we introduce a distributed HPC version of Bowtie2 tool, the first publicly available to our knowledge.
We present the new distributed-memory run-time system (RTS) of the C++-based open-source structured parallelprogramming library FastFlow. The new RTS enables the execution of FastFlow shared-memory applications writt...
详细信息
We present the new distributed-memory run-time system (RTS) of the C++-based open-source structured parallelprogramming library FastFlow. The new RTS enables the execution of FastFlow shared-memory applications written using its Building Blocks (BBs) on distributed systems with minimal changes to the original program. The changes required are all high-level and deal with introducing distributed groups (dgroup), i.e., logical partitions of the BBs composing the application streaming graph. A dgroup, which in turn is implemented using FastFlow's BBs, can be deployed and executed on a remote machine and communicate with other dgroups according to the original shared-memory FastFlow streaming programming model. We present how to define the distributed groups and how we faced the problem of data serialization and communication performance tuning through transparent messages' batching and their scheduling. Finally, we present a study of the overhead introduced by dgroups considering some benchmarks on a sixteen-node cluster.
Similarity joins are recognized to be among the most used data processing and analysis operations. We introduce a C++-based high-levelparallel pattern implemented on top of FastFlow Building Blocks to provide the pro...
详细信息
Similarity joins are recognized to be among the most used data processing and analysis operations. We introduce a C++-based high-levelparallel pattern implemented on top of FastFlow Building Blocks to provide the programmer with ready-to-use similarity join computations. The SimilarityJoin pattern is implemented according to the MapReduce paradigm enriched with locality sensitive hashing (LSH) to optimize the whole computation. The new parallel pattern can be used with any C++ serializable data structure and executed on shared- and distributed-memory machines. We present experimental validations of the proposed solution considering two different clusters and small and large input datasets to evaluate in-core and out-of-core executions. The performance assessment of the SimilarityJoin pattern has been conducted by comparing the execution time against the one obtained from the original hand-tuned Hadoop-based implementation of the LSH-based similarity join algorithms as well as a Spark-based version. The experiments show that the SimilarityJoin pattern: (1) offers a significant performance improvement for small and medium datasets;(2) is competitive also for computations using large input datasets producing out-of-core executions.
We present the third generation of the C++-based open-source skeleton programming framework SkePU. Its main new features include new skeletons, new data container types, support for returning multiple objects from ske...
详细信息
We present the third generation of the C++-based open-source skeleton programming framework SkePU. Its main new features include new skeletons, new data container types, support for returning multiple objects from skeleton instances and user functions, support for specifying alternative platform-specific user functions to exploit e.g. custom SIMD instructions, generalized scheduling variants for the multicore CPU backends, and a new cluster-backend targeting the custom MPI interface provided by the StarPU task-based runtime system. We have also revised the smart data containers' memory consistency model for automatic data sharing between main and device memory. The new features are the result of a two-year co-design effort collecting feedback from HPC application partners in the EU H2020 project EXA2PRO, and target especially the HPC application domain and HPC platforms. We evaluate the performance effects of the new features on high-end multicore CPU and GPU systems and on HPC clusters.
parallelprogramming has become ubiquitous;however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defi...
详细信息
parallelprogramming has become ubiquitous;however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defined programming patterns in order to assist programmers and shield them from low-level aspects. As the complexity of problems, and consequently the need for computing capacity, grows, we have directed our research toward simultaneous CPU-GPU execution of data parallel skeletons to achieve a performance gain. GPUs are optimized with respect to throughput and designed for massively parallel computations. Nevertheless, we analyze whether the additional utilization of the CPU for data parallel skeletons in the Muenster Skeleton Library leads to speedups or causes a reduced performance, because of the smaller computational capacity of CPUs compared to GPUs. We present a C implementation based on a static distribution approach. In order to evaluate the implementation, four different benchmarks, including matrix multiplication, N-body simulation, Frobenius norm, and ray tracing, have been conducted. The ratio of CPU and GPU execution has been varied manually to observe the effects of different distributions. The results show that a speedup can be achieved by distributing the execution among CPUs and GPUs. However, both the results and the optimal distribution highly depend on the available hardware and the specific algorithm.
Multi-core processors and clusters of multi-core processors are ubiquitous. They provide scalable performance yet introducing complex and low-levelprogramming models for shared and distributed memory programming. Thu...
详细信息
Multi-core processors and clusters of multi-core processors are ubiquitous. They provide scalable performance yet introducing complex and low-levelprogramming models for shared and distributed memory programming. Thus, fully exploiting the potential of shared and distributed memory parallelization can be a tedious and error-prone task: programmers must take care of low-level threading and communication (e.g. message passing) details. In order to assist programmers in developing performant and reliable parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel and distributed programming patterns, thus shielding programmers from low-level aspects of parallel and distributed programming. In this paper we take on the design and implementation of the well-known Farm skeleton. In order to address the hybrid architecture of multi-core clusters we present a two-tier implementation built on top of MPI and OpenMP. On the basis of three benchmark applications, including a simple ray tracer, an interacting particles system, and an application for calculating the Mandelbrot set, we illustrate the advantages of both skeletal programming in general and this two-tier approach in particular.
Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel a...
详细信息
Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel applications. However, because of their diverging architectures programmers are facing diverging programming paradigms. Programmers also have to deal with low-level concepts of parallelprogramming that make it a cumbersome task. In order to assist programmers in developing parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallelprogramming patterns, thereby shielding programmers from low-level aspects of parallelprogramming. The main contribution of this paper is a comparison of two skeleton library implementations, one in C++ and one in Java, in terms of library design and programmability. Besides, on the basis of four benchmark applications we evaluate the performance of the presented implementations on two test systems, a GPU cluster and a Xeon Phi system. The two implementations achieve comparable performance with a slight advantage for the C++ implementation. Xeon Phi performance ranges between CPU and GPU performance.
暂无评论