Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream....
详细信息
ISBN:
(纸本)9781605587080
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. this paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.
this paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on...
详细信息
ISBN:
(纸本)9798400704352
this paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointerbased data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. Whencompared with cache-optimized trees, PMAswere slower to update but faster to scan. the batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3x faster batch-insert throughput and 4x faster range-query throughput compared with compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees. We further evaluate the CPMA compared with compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. the CPMA is on average 1.2x faster on a suite of graph algorithms and 2x faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3x faster on graph algorithms and 2x faster on batch inserts compared with Aspen.
A future is a language construct that allows programmers to expose parallelism in applicative languages such as MultiLisp [5] with minimal effort. In this paper we describe a technique for implementing futures, which ...
详细信息
ISBN:
(纸本)0897915895
A future is a language construct that allows programmers to expose parallelism in applicative languages such as MultiLisp [5] with minimal effort. In this paper we describe a technique for implementing futures, which we call leapfrogging, that reduces blocking due to load imbalance. the utility of leapfrogging is enhanced by the fact that it is completely platform-independent, is free from deadlock, and places a bound on stack sizes that is at most a small constant times the maximum stack size encountered during a sequential execution of the same computation. We demonstrate the performance of leapfrogging using a prototype implementation written in C++.
Large-scale PC clusters, such as ACSI Lightning with 2,816 processors and the AIST supercluster with more than 3,000 processors, have been constructed. Although software distributed shared memory (S-DSM) provides an a...
详细信息
Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. R...
详细信息
ISBN:
(纸本)9780897917001
Many applications in a variety of domains including digital signal processing, image processing, and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modules and assigning a subset of the available processors to each module. this paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. the formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. the first algorithm is based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P(4)k(2)) time. We also present a heuristic algorithm that is linear in the number of processors and establish withtheoretical and practical results that the solutions obtained are optimal in practical situations. the entire framework is implemented as an automatic mapping tool for the Fx parallelizing compiler for High Performance Fortran. We present experimental results that demonstrate the importance of choosing a good mapping and show that the methods presented yield efficient mappings and predict optimal performance accurately.
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance...
详细信息
ISBN:
(纸本)9781450332057
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). this paper presents TBB-NUMA, a parallelprogramming library based on Intel threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations. Copyright 2015 acm.
the proceedings contain 27 papers. the topics discussed include: Lazier imperative programming;parametricity and proving free theorems for functional-logic languages;bijective collection encodings and Boolean operatio...
ISBN:
(纸本)9781450329477
the proceedings contain 27 papers. the topics discussed include: Lazier imperative programming;parametricity and proving free theorems for functional-logic languages;bijective collection encodings and Boolean operations with hereditarily binary natural numbers;design and implementation of a multithreaded virtual machine for executing linear logic programs;proofs in continuation-passing style: normalization of Gödel's system T extended with sums and delimited control operators;a type theoretic specification for partial evaluation;continuations, processes, and sharing;elimination of square roots and divisions by partial inlining;real-time matching of Antescofo temporal patterns;on the declarative structure of quantum concepts: states and observables;proving operational termination of declarative programs in general logics;theories of homomorphic encryption, unification, and the finite variant property;and BiFluX: a bidirectional functional update language for XML.
this paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code p...
详细信息
Effective memory hierarchy utilization is critical to the performance of modem multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and ch...
详细信息
Effective memory hierarchy utilization is critical to the performance of modem multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. the first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. the second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance.
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe ...
ISBN:
(纸本)9781605581170
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe memory management and its proof of correctness;programming with proofs and explicit contexts;towards execution time estimation in abstract machine-based languages;similarity-based reasoning in qualified logic programming;classifying integrity checking methods with regard to inconsistency tolerance;comprehending finite maps for algorithmic debugging of higher-order functional programs;parallel execution of multi-set constraint rewrite rules;a rewriting framework for the composition of access control policies;global difference constraint propagation for finite domain solvers;and dynamic variable elimination during propagation solving.
暂无评论