The application of artificial neural networks (ANN) in real-time embedded systems demands high performance computers. Miniaturized massively parallelarchitectures are suitable computation platforms for this task. An ...
详细信息
ISBN:
(纸本)0780320182
The application of artificial neural networks (ANN) in real-time embedded systems demands high performance computers. Miniaturized massively parallelarchitectures are suitable computation platforms for this task. An important question which arises is how to establish an effective mapping from ANN algorithms to hardware. In this paper, we demonstrate how an effective mapping can be achieved with our programming environment in close combination with an optimized architecture design targeted for neuro-computing.
A parallel computing procedure for computing the bounds on the J-integral in functionally graded materials is presented based on a Neumann element a-posteriori error bound. The finite element solution of J-integral is...
详细信息
We evaluate the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate gradient iterative matrix-solvers on High-Performance Computing and Communications(HPCC) pla...
详细信息
ISBN:
(纸本)0818675829
We evaluate the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate gradient iterative matrix-solvers on High-Performance Computing and Communications(HPCC) platforms. We discuss the use of intrinsic functions, data distribution directives and explicitly parallel constructs to optimize performance by minimizing communications requirements in a portable manner. We focus on implementations using the existing HPF definitions but also discuss issues arising that may influence a revised definition for HPF-2. Some of the codes discussed are available on the World Wide Web at http://***/hpfa/, along-with other educational and discussion material related to applications in HPF.
A common pattern in high performance scientific computing is the structured grid pattern in which one or more elements of a matrix are computed as a stencil operation of other matrix neighbouring elements. Since there...
详细信息
ISBN:
(纸本)9781728176284
A common pattern in high performance scientific computing is the structured grid pattern in which one or more elements of a matrix are computed as a stencil operation of other matrix neighbouring elements. Since there are multiple options to efficiently implement this pattern on modern computing architectures, we provide a comparison of the performance of a number of parallel implementations on a multi-core system with GPU capabilities and also on a FPGA embedded inside a SoC. The application used for this case study implements the propagation of wireless signals in a bi-dimensional environment, considering reflections and signal attenuation. The parallelprogramming paradigms examined in this paper include CUDA, TBB, Rust, OpenMP, and HLS as hardware description paradigm, with CUDA proving to be the fastest implementation.
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared me...
ISBN:
(纸本)081867038X
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared memory system supporting multiple memory consistency models;supercompilers for massively parallelarchitectures;investigation of some hardware accelerators for relational algebra operations;implementing higher-order gamma on MasPar: a case study;a framework for visual parallelprogramming;parallelizing a PDE solver: experiences with PISCES-MP;efficient scalable mesh algorithms for merging, sorting and selection;and constructing parallel implement at ions with algebraic programming tools.
As multi-core processor systems become more and more widespread, the demand for efficient parallel algorithms also propagates into the field of computer graphics. This is especially true for physically based simulatio...
详细信息
As multi-core processor systems become more and more widespread, the demand for efficient parallel algorithms also propagates into the field of computer graphics. This is especially true for physically based simulation, which is notorious for expensive numerical methods. In this work, we explore possibilities for accelerating physically based simulation algorithms on multi-core architectures. Two components of physically based simulation represent a great potential for bottlenecks in parallelisation: implicit time integration and collision handling. From the parallelisation point of view these two components are substantially different. Implicit time integration can be treated efficiently using static problem decomposition. The linear system arising in this context is solved using a data-parallel preconditioned conjugate gradient algorithm. The collision handling stage, however, requires a different approach, due to its dynamic structure. This stage is handled using multi-threaded programming with fully dynamic task decomposition. In particular, we propose a new task splitting approach based on a reasonable estimation of work, which analyses previous simulation steps. Altogether, the combination of different parallelisation techniques leads to a concise and yet versatile framework for highly efficient physical simulation. (C) 2008 Elsevier Ltd. All rights reserved.
In this paper, we present a domain-specific language, referred to as OptiSDR, that matches high level digital signal processing (DSP) routines for software defined radio (SDR) to their generic parallel executable patt...
详细信息
ISBN:
(纸本)9781479975754
In this paper, we present a domain-specific language, referred to as OptiSDR, that matches high level digital signal processing (DSP) routines for software defined radio (SDR) to their generic parallel executable patterns targeted to heterogeneous computing architectures (HCAs). These HCAs includes a combination of hybrid GPU-CPU and DSP-FPGA architectures that are programmed using different programming paradigms such as C/C++, CUDA, OpenCL, and/or VHDL. OptiSDR presents an intuitive single high-level source code and near specification-level approach for optimization and facilitation of HCAs. OptiSDR uses an optimized embedded domain-specific language (DSL) compiler framework called Delite. Our focus is on the programming language expressiveness for parallelprogramming and optimization of typical DSP algorithms for deployment on SDR HCAs. We demonstrate the capability of OptiSDR to express the solution to the issues of parallel DSP low-level implementation complexities in the closest way to the original parallelprogramming of SDR systems. This paper will achieve these by focusing on three generic parallel executable patterns suitable for DSP routines such as cross-correlation, convolution in FIR filter based Hilbert transformers, and fast Fourier transforms for spectral analysis. This paper concludes with a performance analysis using DSP algorithms that tests automatically generated code against hand-crafted solutions.
A multidimensional binary partition (MBP) is a data structure determined by a set of points in n-dimensional space. On certain parallelarchitectures, this data structure can be easily distributed across the processin...
详细信息
A multidimensional binary partition (MBP) is a data structure determined by a set of points in n-dimensional space. On certain parallelarchitectures, this data structure can be easily distributed across the processing nodes of the machine and can provide a natural technique for load balancing and partitioning of application problems that depend on a distribution of dynamically changing points in multidimensional space. This paper describes parallel algorithms for generating and using MBPs on a hypercube parallel machine. It is also shown how these distributed data structures allow efficient parallel searches of the data set. The performance of an implementation of these algorithms on an NCUBE hypercube is presented.
Exploiting the emerging reality of affordable multi-core architectures goes through providing programmers with simple abstractions that would enable them to easily turn their sequential programs into concurrent ones t...
详细信息
ISBN:
(纸本)9781605587080
Exploiting the emerging reality of affordable multi-core architectures goes through providing programmers with simple abstractions that would enable them to easily turn their sequential programs into concurrent ones that expose as much parallelism as possible. While transactional memory promises to make concurrent programming easy to a wide programmer community, current implementations either disallow nested transactions to run in parallel or do not scale to arbitrary parallel nesting depths. This is an important obstacle to the central goal of transactional memory, as programmers can only start parallel threads in restricted parts of their code. This paper addresses the intrinsic difficulty behind the support for parallel nesting in transactional memory, and proposes a novel solution that, to the best of our knowledge, is the first practical solution to meet the lowest theoretical upper bound known for the problem. Using a synthetic workload configured to test parallel transactions on a multi-core machine, a practical implementation of our algorithm yields substantial speed-ups (up to 22x with 33 threads) relatively to serial nesting, and shows that the time to start and commit transactions, as well as to detect conflicts, is independent of nesting depth.
Application portability between different multicore architecture-parallelprogramming paradigm/tool pairs is a big problem nowadays leading often to a complete rewrite of an application when switching from an architec...
详细信息
ISBN:
(纸本)9780769546766
Application portability between different multicore architecture-parallelprogramming paradigm/tool pairs is a big problem nowadays leading often to a complete rewrite of an application when switching from an architecture-paradigm pair to another. This is caused by a wide variety of architectural properties requiring different optimization techniques for different architectures, typically hiding the essence of (parallel) computation defined by the application. In this paper, we introduce the Multi-Core Portability Abstraction (MCPA) simplifying portability and implementation of parallel applications making use of shared memory. It abstracts away typical architecture dependent effects caused by latency, synchronization, and partitioning and acts as an executable intermediate abstraction/reference implementation as well as a tool for analyzing the intrinsic parallelism of the application and relative goodness of architectures in executing it. We give a short application example with performance measurements.
暂无评论