The proceedings contain 24 papers. The topics discussed include: CUDA-lite: reducing GPU programming complexity;MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs;on the scalability of an automatic...
ISBN:
(纸本)3540897399
The proceedings contain 24 papers. The topics discussed include: CUDA-lite: reducing GPU programming complexity;MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs;on the scalability of an automatically parallelized irregular application, statistically analyzing execution variance for soft real-time applications, minimum lock assignment: a method for exploiting concurrency among critical sections;set-congruence dynamic analysis for thread-level speculation (TLS);thread safety through partition and effect agreements;a fully parallel LISP2 compactor with preservation of the sliding properties;design for interoperability in stAPL: pMatrices and linear algebra algorithms;implementation of sensitivity analysis for automatic parallelization;just-in-time locality and percolation for optimizing irregular applications on a manycore architecture;and exploring the optimization space of dense linear algebra kernels.
The proceedings contain 14 papers. The special focus in this conference is on languages and compilers for parallelcomputing. The topics include: GASNet-EX: A high-performance, portable communication library for exasc...
ISBN:
(纸本)9783030346263
The proceedings contain 14 papers. The special focus in this conference is on languages and compilers for parallelcomputing. The topics include: GASNet-EX: A high-performance, portable communication library for exascale;nested parallelism with algorithmic skeletons;HDArray: parallel array interface for distributed heterogeneous devices;automating the exchangeability of shared data abstractions;design and performance analysis of real-time dynamic streaming applications;a similarity measure for gpu kernel subgraph matching;new opportunities for compilers in computer security;footmark: A new formulation for working set statistics;towards an achievable performance for the loop nests;extending index-array properties for data dependence analysis;optimized sound and complete data race detection in structured parallel programs;compiler optimizations for parallel programs.
Dense linear algebra kernels such as matrix multiplication have been used as benchmarks to evaluate the effectiveness of many automated compiler optimizations. However, few studies have looked at collectively applying...
详细信息
ISBN:
(纸本)9783540897392
Dense linear algebra kernels such as matrix multiplication have been used as benchmarks to evaluate the effectiveness of many automated compiler optimizations. However, few studies have looked at collectively applying the transformations and parameterizing them for external search. In this paper, we take a detailed look at the optimization space of three dense linear algebra kernels. We rise a transformation scripting language (POET) to implement each kernel-level optimization as applied by ATLAS. We then extensively parameterize these optimizations from tire perspective of a, general-purpose compiler and rise a stand-alone empirical search engine to explore the optimization space using several different search strategies. Our exploration of the search space reveals key interaction among several transformations that must be considered by compilers to approach the level of efficiency obtained through manual timing of kernels.
A study is presented in applying optimistic parallel discrete event simulation techniques using reverse execution to perform instruction-level simulations of distributed memory multi-processor systems. A static progra...
详细信息
ISBN:
(纸本)9780769528984
A study is presented in applying optimistic parallel discrete event simulation techniques using reverse execution to perform instruction-level simulations of distributed memory multi-processor systems. A static program analysis approach is described to optimize pre-processed simulated applications in order to remove certain overheads associated with forward event execution and to enable reversible execution. Reverse execution of floating point operations are also considered. Preliminary performance measurements are presented indicating this approach offers promise in speeding up parallel multi-processor simulations.
Demand for instruction level parallelism calls for increasing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distri...
详细信息
ISBN:
(纸本)9783540897392
Demand for instruction level parallelism calls for increasing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distributed banks, which offers a technology scalable substrate but a challenging compilation target. This paper introduces a register allocator for spatially partitioned architectures. The allocator performs bank assignment together with allocation. It minimizes spill code and optimizes bank selection based on a priority function. This algorithm is unique because it must reason about multiple competing resource constraints and dependencies exposed by these architect tires. We demonstrate an algorithm that uses critical path estimation, delays front registers to consuming functional units, and hardware resource constraints. We evaluate the algorithm on TRIPS, a functional, partitioned, tiled processor with register banks distributed on top of a 4 x 4 grid of ALUs. These results show that the priority banking algorithm implements a number of policies that improve performance, performance is sensitive to bank assignment, and the compiler manages this resource well.
As hardware systems move toward multicore and multi-threaded architectures, programmers increasingly rely on automated tools to help with both the parallelization of legacy codes and effective exploitation of all avai...
详细信息
ISBN:
(纸本)9783540897392
As hardware systems move toward multicore and multi-threaded architectures, programmers increasingly rely on automated tools to help with both the parallelization of legacy codes and effective exploitation of all available hardware resources. Thread-level speculation (TLS) has been proposed as a technique to parallelize the execution of serial codes or serial sections of parallel codes. One of the key aspects of TLS is task selection for Speculative execution. In this paper we propose a cost model for compiler-driven task selection for TLS. The model employs profile-based analysis of may-dependences to estimate the probability of successful speculation. We discuss two techniques to eliminate potential inter-task dependences, thereby improving the rate Of Successful speculation. We also present a profiling tool, DProf, that is used to provide run-time information about may-dependences to the compiler and map dynamic dependences to the source code. This information is also made available to the programmer to assist in code rewriting aud/or algorithm redesign. We used DProf to quantify the potential of this approach and we present results on selected applications from the SPEC CPU2006 and SEQUOIA benchmarks.
CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programm...
详细信息
ISBN:
(纸本)9783540897392
CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently or) shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just CPU architectures.
Programming paradigms are designed to express algorithms elegantly and efficiently. There are many parallel programming paradigms, each suited to a certain class of problems. Selecting the bestparallel programming pa...
详细信息
ISBN:
(纸本)9783540897392
Programming paradigms are designed to express algorithms elegantly and efficiently. There are many parallel programming paradigms, each suited to a certain class of problems. Selecting the bestparallel programming paradigm for a. problem minimizes programming effort and maximizes performance. Given the increasing complexity of parallel applications, no one paradigm may be suitable for all components of an application. Today, mostparallel scientific applications are programmed with a single paradigm and the challenge of multi-paradigm parallel programming remains unmet in the broader community. We believe that each component, of a parallel program should be programmed using the most suitable paradigm. Furthermore, it is not sufficient to simply bolt modules together: programmers should be able to switch between paradigms easily, and resource management across paradigms should be automatic. We present a pre-existing adaptive run-time system (ARTS) and show how it can be used to meet these challenges by allowing the simultaneous use of multiple parallel programming paradigms and supporting resource management across all of them. We discuss the implementation of some common paradigms within the ARTS and demonstrate the use of multiple paradigms within our feature-rich unstructured mesh framework. We show how this approach boosts performance and productivity for an application developed using this framework.
Application performance is heavily dependent on the compiler optimizations. Modern compilers rely largely on the information made available to them at the time of compilation. In this regard, specializing the code acc...
详细信息
ISBN:
(纸本)9783540852605
Application performance is heavily dependent on the compiler optimizations. Modern compilers rely largely on the information made available to them at the time of compilation. In this regard, specializing the code according to input values is an effective way to communicate necessary information to the compiler. However, the static specialization suffers from possible code explosion and dynamic specialization requires runtime compilation activities that may degrade the overall performance of the application. This article proposes an automated approach for specializing code that is able to address both the problems of code size increase and the overhead of runtime activities. We first obtain optimized code through specialization performed at static compile time and then generate a template that can work for a large set of values through runtime specialization. Our experiments show significant improvement for different SPEC benchmarks on Itanium-II(IA-64) and Pentium-IV processors using ice and gcc compilers.
暂无评论