The Generate-Test-Aggregate (GTA for short) algorithm is modeled following a simple and straightforward programming pattern, for combinatorial problems. First, generate all candidates;second, test and filter out inval...
详细信息
The Generate-Test-Aggregate (GTA for short) algorithm is modeled following a simple and straightforward programming pattern, for combinatorial problems. First, generate all candidates;second, test and filter out invalid ones;finally, aggregate valid ones to make the final result. These three processing steps can be specified by three building blocks namely, generator, tester, and aggregator. Despite the simplicity of algorithm design, implementing the GTA algorithm naively following the three processing steps, i.e., brute-force, will result in an exponential-cost computation, and thus it is impractical for processing large data. The theory of GTA illustrates that if the definitions of generator, tester, and aggregator satisfy certain conditions, an efficient (usually near-linear cost) MapReduce program can be automatically derived from the GTA algorithm. The principle of GTA is attractive but how to make it being practically useful, remains as an important and challenge problem due to the complexity of GTA program transformations. In this paper, we report on our studying and implementation of a practical GTA library (written in the functional language Scala) which provides a systematic parallelprogramming approach for big-data analysis with MapReduce. The library provides a simple functional style programming interface and hides all the internal transformations. With this library, users can write parallel programs in a sequential manner in terms of the GTA algorithm, and the efficiency of the generated MapReduce programs is guaranteed systematically. Therefore, parallelprogramming for many problems could become no more a tough job. We demonstrate the usefulness of our GTA library on some interesting problems involving large data and show that lots of applications can be easily and efficiently solved by using our library. (C) 2013 Elsevier By. All rights reserved.
Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for e...
详细信息
Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for example, computational fluid dynamics. This work contributes to the research on high-level parallel programming by discussing the generalizable implementation of a three-dimensional stencil skeleton that works in heterogeneous computing environments. Two exemplary programs, a gas simulation with the Lattice Boltzmann method, and a mean blur, are executed in a multi-node multi-graphics processing units environment, proving the runtime improvements in heterogeneous computing environments compared to a sequential program.
This paper aims to exploit the massive parallelism of Field-Programmable Gate Arrays (FPGAs) by programming them in OCaml, a multiparadigm and statically typed language. It first presents O2B, an implementation of the...
详细信息
This paper aims to exploit the massive parallelism of Field-Programmable Gate Arrays (FPGAs) by programming them in OCaml, a multiparadigm and statically typed language. It first presents O2B, an implementation of the OCaml virtual machine using a softcore processor to run the entire OCaml language on an FPGA. It then introduces Macle, a language to express, in ML-style, hardware-accelerated user-defined functions, implemented as gates and registers on the same FPGA. Macle allows to implement pure computations and compose them in parallel. It also supports processing of dynamic data structures such as arrays, matrices and trees allocated by the OCaml runtime in the memory of the softcore processor. Macle functions can then be called, as hardware accelerators, by OCaml programs executed by O2B. This combination of Macle and OCaml codes in a single source program enables to easily prototype FPGA applications mixing numeric and symbolic computations.
We analyze the performance portability of the skeleton-based, single-source multi-backend high-levelprogramming framework SkePU across multiple different CPU-GPU heterogeneous systems. Thereby, we provide a systemati...
详细信息
We analyze the performance portability of the skeleton-based, single-source multi-backend high-levelprogramming framework SkePU across multiple different CPU-GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-levelparallelprogramming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.
Grid technologies aim to harness the computational capabilities of widely distributed collections of computers. Due to the heterogeneous and dynamic nature of the set of grid resources, the programming and optimisatio...
详细信息
Grid technologies aim to harness the computational capabilities of widely distributed collections of computers. Due to the heterogeneous and dynamic nature of the set of grid resources, the programming and optimisation burden of a low level approach to grid computing is clearly unacceptable for large scale, complex applications. The development of grid applications can be simplified by using high-levelprogramming environments. In the present work, we address the problem of the mapping of a high-level grid application onto the computational resources. In order to optimise the mapping of the application, we propose to automatically generate performance models from the application using the process algebra PEPA. We target applications written with the high-level environment ASSIST, since the use of such a structured environment allows us to automate the study of the application more effectively.
parallelprogramming is gaining ground in various domains due to the tremendous com- putational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Un...
详细信息
parallelprogramming is gaining ground in various domains due to the tremendous com- putational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configura- tion is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level program- ming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e. g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time.
This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging t...
详细信息
This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging the notable components of stream parallel applications: stream sources and stream processing stages. A set of tools process SPar code (C++ annotated code using the SPar attributes) to generate FastFlow C++ code that exploits the stream parallelism denoted by SPar annotations while targeting shared memory multi-core architectures. We outline the main SPar features along with the main implementation techniques and tools. Also, we show the results of experiments assessing the feasibility of the entire approach as well as SPar's performance and expressiveness.
暂无评论