this paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running withthe Blizzard fine-grain distributed s...
详细信息
ISBN:
(纸本)9780897919067
this paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running withthe Blizzard fine-grain distributed shared memory system. this approach exploits the underlying system's cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.
the use of futures can generate arbitrary dependences in the computation, making it difficult to detect races efficiently. Algorithms proposed by priorwork to detect races on programs with futures all have to execute ...
ISBN:
(纸本)9781450368186
the use of futures can generate arbitrary dependences in the computation, making it difficult to detect races efficiently. Algorithms proposed by priorwork to detect races on programs with futures all have to execute the program sequentially. We propose F-Order, the first known parallel race detection algorithm that detects races on programs that use futures. Given a computation with work T-1 and span T-infinity, our algorithm detects races in time O((T-1 lg (k) over cap + k(2))/ P + T-infinity(k + lg r lg (k) over cap)) on P processors, where k is the number of future operations, r is the maximum number of readers per memory location, and (k) over cap is the maximum number of future operations done by a single future task, which is typically small. We have also implemented a prototype system based on the proposed algorithm and empirically demonstrates its practical efficiency and scalability.
To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing suppor...
详细信息
ISBN:
(纸本)9781605587080
To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be able to achieve efficient hardware utilization for parallel workloads. Balancing run queue length globally ignores the needs of parallel applications where threads are required to make equal progress. In this paper we present a load balancing technique designed specifically for parallel applications running on multicore systems. Instead of balancing run queue length, our algorithm balances the time a thread has executed on "faster" and "slower" cores. We provide a user level implementation of speed balancing on UMA and NUMA multi-socket architectures running Linux and discuss behavior across a variety of workloads, usage scenarios and programming models. Our results indicate that speed balancing when compared to the native Linux load balancing improves performance and provides good performance isolation in all cases considered. Speed balancing is also able to provide comparable or better performance than DWRR, a fair multi-processor scheduling implementation inside the Linux kernel. Furthermore, parallel application performance is often determined by the implementation of synchronization operations and speed balancing alleviates the need for tuning the implementations of such primitives.
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation unde...
详细信息
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. the irregular parallelism is modeled by task dependence graphs with mixed granularities. the trade-off in achieving both time and space efficiency is investigated. the main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. this work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. the experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.
Many low-level optimizations for NVIDIA GPU can only be implemented in native hardware assembly (SASS). However, programming in SASS is unproductive and not portable. To simplify low-level GPU programming, we present ...
详细信息
For a long time efficient use of parallel computers has been hindered by dependencies introduced in software through low-level implementation practice. In this paper we present a programming environment and language c...
详细信息
For a long time efficient use of parallel computers has been hindered by dependencies introduced in software through low-level implementation practice. In this paper we present a programming environment and language called Object-Math (Object oriented Mathematical language for scientific computing), which aims at eliminating this problem by allowing the user to represent mathematical equation-based models directly in the system. the system performs analysis of mathematical models to extract parallelism and automatically generates parallel code for numerical solution. In the context of industrial applications in mechanical analysis, we have so far primarily explored generation of parallel code for solving systems of ordinary differential equations (ODEs), in addition to preliminary work on generating code for solving partial differential equations. Two approaches to extracting parallelism have been implemented and evaluated: extracting parallelism at the equation system level and at the single equation level, respectively. We found that for several applications the corresponding systems of equations do not partition well into subsystems. this means that the equation system level approach is of restricted general applicability. thus, we focused on the equation-level approach which yielded significant parallelism for ODE systems solution. For the bearing simulation applications we present here, the achieved speedup is however critically dependent on low communication latency of the parallel computer.
this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based...
详细信息
ISBN:
(纸本)9781450332057
this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the propose...
详细信息
ISBN:
(纸本)9781450392044
In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the proposed algorithm is the first parallel B&B algorithm that includes a history-based domination technique and is the first parallel B&B algorithm for solving the SOP using a pure B&B approach. the proposed algorithm takes a pool-based approach and employs a collection of novel techniques that we have developed to achieve effective parallel exploration of the solution space, including parallel history domination, history table memory management, and a thread restart technique. the proposed algorithm was experimentally evaluated using the SOPLIB and TSPLIB benchmarks. the results show that using ten threads with a time limit of one hour on the medium-difficulty instances, the proposed algorithm gives a geometric-mean speedup of 19.9 on SOPLIB and 10.23 on TSPLIB, with super-linear speedups up to 65x seen on 17 instances.
JavaScript, the most popular language on the Web, is rapidly moving to the server-side, becoming even more pervasive. Still, JavaScript lacks support for shared memory parallelism, making it challenging for developers...
详细信息
ISBN:
(纸本)9781450319225
JavaScript, the most popular language on the Web, is rapidly moving to the server-side, becoming even more pervasive. Still, JavaScript lacks support for shared memory parallelism, making it challenging for developers to exploit multicores present in both servers and clients. In this paper we present TigerQuoll, a novel API and runtime for parallelprogramming in JavaScript. TigerQuoll features an event-based API and a parallel runtime allowing applications to exploit a mutable shared memory space. the programming model of TigerQuoll features automatic consistency and concurrency management, such that developers do not have to deal with shared-data synchronization. TigerQuoll supports an innovative transaction model that allows for eventual consistency to speed up high-contention workloads. Experiments show that TigerQuoll applications scale well, allowing one to implement common parallelism patterns in JavaScript.
Chase and Lev's concurrent deque is a key data structure in shared-memory parallelprogramming and plays an essential role in work-stealing schedulers. We provide the first correctness proof of an optimized implem...
详细信息
暂无评论