Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to convention...
详细信息
ISBN:
(纸本)9781467355254;9781467355247
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%.
Amorphous Data parallelism has proven to be a suitable vehicle for implementing concurrent graph algorithms effectively on multi-core architectures. In view of the growing complexity of graph algorithms for informatio...
详细信息
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a ...
详细信息
ISBN:
(纸本)9781467355254;9781467355247
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high level language (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core i7/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.
The proceedings contain 44 papers. The topics discussed include: reduce data coherence cost with an area efficient double layer counting bloom filter;synchronization-aware dynamic thread scheduling for improving perfo...
ISBN:
(纸本)9780769548982
The proceedings contain 44 papers. The topics discussed include: reduce data coherence cost with an area efficient double layer counting bloom filter;synchronization-aware dynamic thread scheduling for improving performance and saving energy in multi-core embedded systems;efficient and secure trust negotiation over the Internet;design a low-power scheduling mechanism for a multicore android system;energy-aware scheduling for weakly-hard real-time system with I/O device;sparse matrix-vector multiplication based on network-on-chip: on data mapping;monoecism watermarking algorithm;a new piecewise chaotic mapping and its application in image secure communication;formulistic detection of malicious fast-flux domains;task scheduling prediction algorithms for dynamic hardware/software partitioning;and triggering cascades on strongly connected directed graphs.
Debugging parallel and concurrent applications is well-recognized as a time-consuming task, which often requires a significant part of the application development process. In the context of embedded systems, Multi-Pro...
详细信息
Debugging parallel and concurrent applications is well-recognized as a time-consuming task, which often requires a significant part of the application development process. In the context of embedded systems, Multi-Processor-System-on-Chip(MPSoC) architectures feature numerous multicore processors which may be coupled with heterogeneous processors like Digital Signal Processors (DSPs) and/or application-specific accelerators. In this situation, it is important that developers are provided with high-level programming environments able to efficiently exploit these architectures, as well as suitable debugging tools. Dataflow programming models were explicitly designed to program parallelarchitectures and they have the ability to abstract away heterogeneous computing complexity. In addition, the stream-processing aspect of multimedia algorithms naturally exhibits data-dependency graphs, which simplifies application design and implementation. In this paper, we propose a new approach for interactive debugging of dataflow applications. Going beyond the long-established ability of interactive debuggers to support sequential programming languages, we describe the functionalities they should be able to provide to debug embedded and parallel dataflow applications. Then we demonstrate our solution to this problem with a proof-of-concept debugger targeting the dataflow framework used on an industrial MPSoC platform. We also explain the development challenges we faced during the implementation of this GDB-based debugger and illustrate its efficiency through a case study of a video decoder debugging session.
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to convention...
详细信息
ISBN:
(纸本)9781467355247
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31% memory address counts by 47%, and data access counts by 38%.
In this paper, a parallel method for solving generalized eigenvalue problem based on multi-core platform is presented, which can provide parts of the eigenpairs in parallel. Compared with traditional numerical method,...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
In this paper, a parallel method for solving generalized eigenvalue problem based on multi-core platform is presented, which can provide parts of the eigenpairs in parallel. Compared with traditional numerical method, the parallel method in this paper using numerical integration, numerical experiments are implemented with a quad-core computer under the programming environment of Matlab parallel toolbox. The problems of computing the frequencies of a plane wing and aircraft pylon are taken as examples, which show the efficiency and applicability of our scheme.
We discuss some performance issues of the tiled Cholesky factorization on non-uniform memory access-time (NUMA) shared memory machines. We show how to optimize thread placement and data placement in order to achieve p...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
We discuss some performance issues of the tiled Cholesky factorization on non-uniform memory access-time (NUMA) shared memory machines. We show how to optimize thread placement and data placement in order to achieve performance gain up to 50% compared to state-of-the-art libraries such as Plasma or MKL.
In this study, we will parallelize the D&C algorithm with CUDA. In stead of recursive programming in D&C, the recursive stack is implemented on the host side (CPU) and the merge operation is executes on GPU in...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
In this study, we will parallelize the D&C algorithm with CUDA. In stead of recursive programming in D&C, the recursive stack is implemented on the host side (CPU) and the merge operation is executes on GPU in parallel. Since the recursive stack is a fully binary tree in this algorithm, the merge operations on the nodes in each layer of the binary tree can be performed synchronously. In this data-parallel computation, with the careful management of data structure, the data of each node can be arranged in the same block and no need to share data between threads, so the parallelism is not broken.
In this paper we present an approach to the parallel implementation of the state minimization problem for nondeterministic finite automata. This approach is based on the truncated branch and bound method and also on t...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
In this paper we present an approach to the parallel implementation of the state minimization problem for nondeterministic finite automata. This approach is based on the truncated branch and bound method and also on the usage of basis and COM automata for the given language. Minimum state automata are searched as sub-automata of the COM automaton. Some sufficient conditions for their equivalence to the given nondeterministic automaton are proved in terms of the loops of the basis automaton. We suggest exact and heuristic state minimization algorithms, discuss their implementation details and provide some experimental results.
暂无评论