With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computer systems. Many issues of parallelism management boil down to discerning which running thr...
详细信息
ISBN:
(纸本)9781605585260
With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computer systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.
Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshalling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited c...
详细信息
ISBN:
(纸本)9781605583471
Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshalling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processor-oriented in associating one worklist per processor: worklist insertion and removal requires no locking, and load balancing requires only occasional locking. However, since multiple queues may contain the same node, significant locking is necessary to avoid concurrent visits by competing processors. This paper presents a memory-oriented solution: memory is partitioned into segments and each segment has its own worklist containing only nodes in that segment. At a given time at most one processor owns a given worklist. By arranging separate single-readersingle-writer forwarding queues to pass nodes from processor i to processor j we can process objects in an order that gives lock-free mainline code and improved locality of reference. This refactoring is analogous to the way in which a compiler changes an iteration space to eliminate data dependencies. While it is clear that our solution can be more effective on NUMA systems, and even necessary when processor-local memory may not be addressed from other processors, slightly surprisingly, it often gives significantly better speed-up on modem multi-cores architectures too. Using caches to hide memory latency loses much of its effectiveness when there is significant cross-processor memory contention or when locking is necessary.
Deadlock in multithreaded programs is an increasingly important problem as ubiquitous multicore architectures force parallelization upon an ever wider range of software. This paper presents a theoretical foundation fo...
详细信息
ISBN:
(纸本)9781605583792
Deadlock in multithreaded programs is an increasingly important problem as ubiquitous multicore architectures force parallelization upon an ever wider range of software. This paper presents a theoretical foundation for dynamic deadlock avoidance in concurrent programs that employ conventional mutual exclusion and synchronization primitives (e. g., multithreaded C/Pthreads programs). Beginning with control flow graphs extracted from program source code, we construct a formal model of the program and then apply Discrete Control Theory to automatically synthesize deadlock-avoidance control logic that is implemented by program instrumentation. At run time, the control logic avoids deadlocks by postponing lock acquisitions. Discrete Control Theory guarantees that the program instrumented with our synthesized control logic cannot deadlock. Our method furthermore guarantees that the control logic is maximally permissive: it postpones lock acquisitions only when necessary to prevent deadlocks, and therefore permits maximal run-time concurrency. Our prototype for C/Pthreads scales to real software including Apache, OpenLDAP, and two kinds of benchmarks, automatically avoiding both injected and naturally occurring deadlocks while imposing modest runtime overheads.
Saturated arithmetic is a typical operation in multimedia applications, most multimedia extensions in the instruction set architecture (ISA) of modern processors provide saturation instructions for such operation. The...
详细信息
ISBN:
(纸本)9781605581668
Saturated arithmetic is a typical operation in multimedia applications, most multimedia extensions in the instruction set architecture (ISA) of modern processors provide saturation instructions for such operation. Therefore, extensive researches have focused on how to utilize saturation instructions to optimize programs. Previous algorithms mainly focus on purely saturated arithmetic, however saturated arithmetic is often mingled with first-order linear recurrence (FOLR) in real life applications. When FLOR pattern appears in the program, previous algorithms can not identify the saturated arithmetic as well. In fact, the saturated arithmetic with FOLR (SAWF) is a new and significant pattern, especially, SAWF with one as coefficient is frequently used in multimedia applications. Hence, it is necessary to explore a method with which such pattern can be efficiently vectorized. This paper discusses how to vectorize SAWF, explores the efficient method to vectorize SAWF with one as coefficient and gives its evaluation and implement a library for the optimizing technique. Such an implementation manner can make compilers are able to exploit it more easily. The experimental results shows the optimizing technique can achieve a speedup of 1.19 to 1.46 on Pentium IV processor. At the same time, the optimizing techniques in this paper can also be used to develop a library for SAWF so a programmer can benefit even without changing the compiler. Copyright 2009 acm.
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
We study two closely related problems in non-preemptive scheduling of sequential jobs on identical parallel machines. In these two settings there are either fixed jobs or nonavailability intervals during which the mac...
详细信息
We study two closely related problems in non-preemptive scheduling of sequential jobs on identical parallel machines. In these two settings there are either fixed jobs or nonavailability intervals during which the machines are not available; in either case, the objective is to minimize the makespan. Both formulations have different applications, e.g. in turnaround scheduling or overlay computing. For both problems we contribute approximation algorithms with an improved ratio of 3/2 + ε, respectively. For scheduling with fixed jobs, a lower bound of 3/2 on the approximation ratio has been obtained by Scharbrodt, Steger&Weisser; for scheduling with non-availability we provide the same lower bound. In total, our approximation ratio for both problems is essentially tight via suitable inapproximability results. We use dual approximation, creation of a gap structure and job configurations, and a PTAS for the multiple subset sum problem. However, the main feature of our algorithms is a new technique for the assignment of large jobs via flexible rounding. Our new technique is based on an interesting cyclic shifting argument in combination with a network flow model for the assignment of jobs to large gaps.
Despite decades of activity, parallel computing remains immature. Like much of computer science, advances in the field are driven by a mixture of theoretical insights and technological advances. But in parallel comput...
详细信息
ISBN:
(纸本)9781605586069
Despite decades of activity, parallel computing remains immature. Like much of computer science, advances in the field are driven by a mixture of theoretical insights and technological advances. But in parallel computing, the gap between theory and practice remains disconcertingly *** theoretical concepts in parallel computing were developed in the seventies including P-completeness, PRAMs, boolean circuits, and more. A rich set of algorithms and complexity classes was built on top of these models, and these ideas influenced early thinking about parallel machines. But applied parallel computing today is little in uenced by these *** the past twenty years computational scientists and engineers have embraced parallelism as an essential computational tool for solving ever larger and more complex problems. By the mid-nineties, the computational science community had settled on a computing paradigm involving distributed memory machines programmed with explicit message passing. Although not elegant, this approach allows for portability and the use of inexpensive commodity hardware. This hardware is sufficiently different from the abstract models proposed in the seventies that insights from the theory community have had minimal impact on mainstream parallel computing practice.A variety of alternative theoretical models have been proposed in the interim that are more reflective of hardware realities than their counterparts from the seventies. These include the Bulk Synchronous Processing and LogP complexity models as well as computational abstractions like Active Messages partitioned global address space languages. But these concepts and tools have been consigned to niche roles within the larger MPI-dominated parallel computing ***, several forces are rearranging the parallel computing landscape and will raise the importance of ideas and techniques from the theory community. The most familiar change is the rise of multi-core processors. The need for n
暂无评论