the Standard Template Adaptive parallel Library (STAPL) is a parallelprogramming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainer...
详细信息
ISBN:
(纸本)9781450301190
the Standard Template Adaptive parallel Library (STAPL) is a parallelprogramming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. the PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). the PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.
this paper presents the open-source COREMU, a scalable and portable parallel emulation framework that decouples the complexity of parallelizing full-system emulators from building a mature sequential one. the key obse...
详细信息
ISBN:
(纸本)9781450301190
this paper presents the open-source COREMU, a scalable and portable parallel emulation framework that decouples the complexity of parallelizing full-system emulators from building a mature sequential one. the key observation is that CPU cores and devices in current (and likely future) multiprocessors are loosely-coupled and communicate through well-defined interfaces. Based on this observation, COREMU emulates multiple cores by creating multiple instances of existing sequential emulators, and uses a thin library layer to handle the inter-core and device communication and synchronization, to maintain a consistent view of system resources. COREMU also incorporates lightweight memory transactions, feedback-directed scheduling, lazy code invalidation and adaptive signal control to provide scalable performance. To make COREMU useful in practice, we also provide some preliminary tools and APIs that can help programmers to diagnose performance problems and (concurrency) bugs. A working prototype, which reuses the widely-used QEMU as the sequential emulator, is with only 2500 lines of code (LOCs) changes to QEMU. It currently supports x64 and ARM platforms, and can emulates up to 255 (1) cores running commodity OSes with practical performance, while QEMU cannot scale above 32 cores. A set of performance evaluation against QEMU indicates that, COREMU has negligible uniprocessor emulation overhead, performs and scales significantly better than QEMU. We also show how COREMU could be used to diagnose performance problems and concurrency bugs of both OS kernel and parallel applications.
On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28]. there are two main difficulties in...
详细信息
ISBN:
(纸本)9781450301190
On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28]. there are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). this problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully-connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline, and itself shares this work withthose of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed terminati
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high perform...
详细信息
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high performance but are too low-level to be accessible to the average programmer. We propose leveraging domain-specific languages (DSLs) to map high-level application code to heterogeneous devices. To demonstrate the potential of this approach we present OptiML, a DSL for machine learning. OptiML programs are implicitly parallel and can achieve high performance on heterogeneous hardware with no modification required to the source code. For such a DSL-based approach to be tractable at large scales, better tools are required for DSL authors to simplify language creation and parallelization. To address this concern, we introduce De lite, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware. We show that OptiML running on De lite achieves single-threaded, parallel, and GPU performance superior to explicitly parallelized MATLAB code in nearly all cases.
this poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of...
详细信息
ISBN:
(纸本)9781605587080
this poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of dense matrices. In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. We demonstrate the performance potential of CnC in this poster, by showing that our Cholesky implementation nearly matches or exceeds competing vendor-tuned codes and alternative programming models. We conclude that the CnC model is well-suited for expressing asynchronous-parallel algorithms on emerging multicore systems.
the Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel...
详细信息
ISBN:
(纸本)9781605587080
the Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel model of parallel computation that reduces opportunities for deadlock and other communication errors. this simple model, plus an application programming interface (API) styled after C's formatted I/O, are designed to make the library easy to learn. the Pilot library exists as a thin layer on top of any standard Message Passing Interface (MPI) implementation, preserving MPI's portability and efficiency, with little performance overhead arising as result of Pilot's additional features.
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for paral...
详细信息
ISBN:
(纸本)9781605587080
Irregular algorithms are organized around pointer-based data structures such as graphs and trees, and they are ubiquitous in applications. Recent work by the Galois project has provided a systematic approach for parallelizing irregular applications based on the idea of optimistic or speculative execution of programs. However, the overhead of optimistic parallel execution can be substantial. In this paper, we show that many irregular algorithms have structure that can be exploited and present three key optimizations that take advantage of algorithmic structure to reduce speculative overheads. We describe the implementation of these optimizations in the Galois system and present experimental results to demonstrate their benefits. To the best of our knowledge, this is the first system to exploit algorithmic structure to optimize the execution of irregular programs.
To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing suppor...
详细信息
ISBN:
(纸本)9781605587080
To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be able to achieve efficient hardware utilization for parallel workloads. Balancing run queue length globally ignores the needs of parallel applications where threads are required to make equal progress. In this paper we present a load balancing technique designed specifically for parallel applications running on multicore systems. Instead of balancing run queue length, our algorithm balances the time a thread has executed on "faster" and "slower" cores. We provide a user level implementation of speed balancing on UMA and NUMA multi-socket architectures running Linux and discuss behavior across a variety of workloads, usage scenarios and programming models. Our results indicate that speed balancing when compared to the native Linux load balancing improves performance and provides good performance isolation in all cases considered. Speed balancing is also able to provide comparable or better performance than DWRR, a fair multi-processor scheduling implementation inside the Linux kernel. Furthermore, parallel application performance is often determined by the implementation of synchronization operations and speed balancing alleviates the need for tuning the implementations of such primitives.
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream....
详细信息
ISBN:
(纸本)9781605587080
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. this paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.
暂无评论