Exploiting the emerging reality of affordable multi-core architectures goes through providing programmers with simple abstractions that would enable them to easily turn their sequential programs into concurrent ones t...
详细信息
ISBN:
(纸本)9781605587080
Exploiting the emerging reality of affordable multi-core architectures goes through providing programmers with simple abstractions that would enable them to easily turn their sequential programs into concurrent ones that expose as much parallelism as possible. While transactional memory promises to make concurrent programming easy to a wide programmer community, current implementations either disallow nested transactions to run in parallel or do not scale to arbitrary parallel nesting depths. this is an important obstacle to the central goal of transactional memory, as programmers can only start parallelthreads in restricted parts of their code. this paper addresses the intrinsic difficulty behind the support for parallel nesting in transactional memory, and proposes a novel solution that, to the best of our knowledge, is the first practical solution to meet the lowest theoretical upper bound known for the problem. Using a synthetic workload configured to test parallel transactions on a multi-core machine, a practical implementation of our algorithm yields substantial speed-ups (up to 22x with 33 threads) relatively to serial nesting, and shows that the time to start and commit transactions, as well as to detect conflicts, is independent of nesting depth.
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. the Level 3 BLAS...
详细信息
ISBN:
(纸本)9781605587080
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. the Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the high-order flop count, and do not perform as well. By demonstrating a straight-forward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.
Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the executio...
详细信息
ISBN:
(纸本)9781605587080
Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose that a processor p is executing a parallel region A after having acquired the lock L protecting A. If another processor p' tries to acquire L, then instead of blocking and waiting for p to complete A, processor p' joins p to help it complete A. Additional processors not blocked on L may also help to execute A. the HELPER runtime system can execute fork-join computations augmented with helper locks and parallel regions. HELPER supports the unbounded nesting of parallel regions. We provide theoretical completion-time and space-usage bounds for a design of HELPER based on work stealing. Specifically, let V be the number of parallel regions in a computation, let T-1 be its work, and let (T) over tilde (infinity) be its "aggregate span" - the sum of the spans (critical-path lengths) of all its parallel regions. We prove that HELPER completes the computation in expected time O(T-1/P + (T) over tilde (infinity) + PV) on P processors. this bound indicates that programs with a small number of highly parallel critical sections can attain linear speedup. For the space bound, we prove that HELPER completes a program using only O(P (S) over tilde (1)) stack space, where e (S) over tilde (1) is the sum, over all regions, of the stack space used by each region in a serial execution. Finally, we describe a prototype of HELPER implemented by modifying the Cilk multithreaded runtime system. We used this prototype to implement a concurrent hash table with a resize operation protected by a helper lock.
To achieve high-performance on multicore systems, shared-memory parallel languages must efficiently implement atomic operations. the commonly used and studied paradigms for atomicity are fine-grained locking, which is...
详细信息
ISBN:
(纸本)9781605587080
To achieve high-performance on multicore systems, shared-memory parallel languages must efficiently implement atomic operations. the commonly used and studied paradigms for atomicity are fine-grained locking, which is both difficult to program and error-prone;optimistic software transactions, which require substantial overhead to detect and recover from atomicity violations;and compiler-generation of locks from programmer-specified atomic sections, which leads to serialization whenever imprecise pointer analysis suggests the mere possibility of a conflicting operation. this paper presents a new strategy for compiler-generated locking that uses data structure knowledge to facilitate more precise alias and lock generation analyses and reduce unnecessary serialization. Implementing and evaluating these ideas in the Java language shows that the new strategy achieves eight-thread speedups of 0.83 to 5.9 for the five STAMP benchmarks studied, outperforming software transactions on all but one benchmark, and nearly matching programmer-specified fine-grained locks on all but one benchmark. the results also indicate that compiler knowledge of data structures improves the effectiveness of compiler analysis, boosting eight-thread performance by up to 300%. Further, the new analysis allows for software support of strong atomicity with less than 1% overhead for two benchmarks and less than 20% for three others. the strategy also nearly matches the performance of programmer-specified fine-grained locks for the SPECjbb2000 benchmark, which has traditionally not been amenable to static analyses.
this paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. the model is designed to provide performance information to an auto-tuning compiler and assist ...
详细信息
ISBN:
(纸本)9781605587080
this paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. the model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. the proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the of...
详细信息
ISBN:
(纸本)9781605587080
As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. this is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that bothparallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments withthree well-known numerical kernels.
For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel ...
详细信息
ISBN:
(纸本)9781605587080
For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines. this paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. this observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them. We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number o...
详细信息
ISBN:
(纸本)9781605587080
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. the influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. the measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.
In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as p...
详细信息
ISBN:
(纸本)9781605587080
In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.
In this paper, we explore the two well-known principles of diversification and intensification in portfolio-based parallel SAT solving. these dual concepts play an important role in several search algorithms including...
详细信息
ISBN:
(纸本)9783642153952
In this paper, we explore the two well-known principles of diversification and intensification in portfolio-based parallel SAT solving. these dual concepts play an important role in several search algorithms including local search, and appear to be a key point in modern parallel SAT solvers. To study their trade-off, we define two roles for the computational units. Some of them classified as Masters perform an original search strategy, ensuring diversification. the remaining units, classified as Slaves are there to intensify their master's strategy. Several important questions have to be answered. the first one is what information should be given to a slave in order to intensify a given search effort? the second one is, how often, a subordinated unit has to receive such information? Finally, the question of finding the number of subordinated units and their connections withthe search efforts has to be answered. Our results lead to an original intensification strategy which outperforms the best parallel SAT solver ManySAT, and solves some open SAT instances.
暂无评论