this paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. the model is designed to provide performance information to an auto-tuning compiler and assist ...
详细信息
ISBN:
(纸本)9781605587080
this paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. the model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. the proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. this level of concurrency pushes typical synchronization primitives (e.g., m...
详细信息
ISBN:
(纸本)9781450362252
throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. this level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency. We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency. Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel pr...
详细信息
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel processor can access remote memory without message passing. AC provides support for distributed arrays as well as pointers to distributed data. Simple array references and pointer dereferencing are sufficient to generate low-overhead remote reads and writes. We have implemented these ideas in a compiler based on the GNU C compiler and targeted at Cray Research's T3D. Initial performance measurements show that AC generates code for remote accesses which is considerably faster than that of the native compiler for structures up to about 16 words in size and virtually equivalent for larger transfers.
this paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operators. Surge exposes a code generation int...
详细信息
ISBN:
(纸本)9781450332057
this paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operators. Surge exposes a code generation interface, decoupled from the core computation, that enables programmers and autotuners to easily generate multiple implementations of the same computation on various parallel architectures such as multi-core CPUs and GPUs. By decoupling computations from architecture-specific implementation, programmers can target multiple architectures more easily, and generate a search space that facilitates optimization and customization for specific architectures. We express in Surge four real-world benchmarks from domains such as sparse linear-algebra and machine learning and from the same performance-portable specification, generate OpenMP and CUDA C++ implementations. Surge generates efficient, scalable code which achieves up to 1.32x speedup over handcrafted, well-optimized CUDA code.
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel im...
详细信息
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated withthe elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N-body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible withthe proper language/compiler/runtime support. Our data parallel N-body code provides a much needed 'benchmark' code for evaluating and improving HPF compilers.
作者:
Sulzmann, MartinLam, Edmund S. L.Programming
Logics and Semantics Group IT University of Copenhagen Rued Langgaards Vej 7 2300 Copenhagen S Denmark School of Computing
National University of Singapore S16 Level 5 3 Science Drive 2 Singapore 117543 Singapore
Multi-set constraint rewriting allows for a highly parallel computational model and has been used in a multitude of application domains such as constraint solving, agent specification etc. Rewriting steps can be appli...
详细信息
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we...
详细信息
ISBN:
(纸本)9780897919067
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed performance measurements on real machines and simulators to understand performance bottlenecks. this leads us to a new parallel implementation that substantially outperforms and out-scales the old one on a range of shared address space platforms, from bus-based centralized memory machine to hardware-coherent distributed memory machines to networks of computers connected by page-based shared virtual memory. the results demonstrate that real time volume rendering is promising on general purpose multiprocessors, and illustrate the utility of tool hierarchies in conjunction with algorithmic and application knowledge to understand memory system interactions and improve parallel algorithms.
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled...
详细信息
ISBN:
(纸本)9780897919067
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled to run efficiently on parallel machines, by optimizing boththroughput (T) and latency (L), using two techniques based on task speedup functions. the first step chooses an appropriate pipeline structure for the system while the second exploits the dataset parallelism intrinsic in the period datastream, and runs multiple datasets in parallel (task/cluster multiplicity) for each clustering. Both techniques were used to compile real-time image-processing problems on an NCUBE-2 multiprocessor. the two techniques showed substantial performance gains.
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe ...
ISBN:
(纸本)9781605581170
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe memory management and its proof of correctness;programming with proofs and explicit contexts;towards execution time estimation in abstract machine-based languages;similarity-based reasoning in qualified logic programming;classifying integrity checking methods with regard to inconsistency tolerance;comprehending finite maps for algorithmic debugging of higher-order functional programs;parallel execution of multi-set constraint rewrite rules;a rewriting framework for the composition of access control policies;global difference constraint propagation for finite domain solvers;and dynamic variable elimination during propagation solving.
this work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distr...
详细信息
暂无评论