The era of parallel computing for the masses is here, but writing correct parallel programs remains a challenge--popular parallel environments offer no analogs for the concepts of structured and safe sequential progra...
详细信息
ISBN:
(纸本)9781605586069
The era of parallel computing for the masses is here, but writing correct parallel programs remains a challenge--popular parallel environments offer no analogs for the concepts of structured and safe sequential programming. The memory model forms the heart of the concurrency semantics of any (shared-memory) parallel language or hardware. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in shared-memory specification. Recent broad community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current popular languages and systems that prevent achieving the vision of structured and safe parallel programming.I will discuss the path to the above convergence, the hard lessons learned, and their implications. A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system--if the programmer writes well-structured (data-race-free) programs, the system will provide high programmability (sequential consistency) and performance. I will discuss why this view is the best we can do with current popular languages, and why it should be unacceptable moving forward. I will then discuss research directions that eliminate worrying about the memory model, but require rethinking popular parallel languages and hardware. In particular, I will argue for languages that eliminate data races by design and provide determinism by default, while retaining the advantages of modern object-oriented programming. I will also argue for hardware that takes advantage of such disciplined programming models to enable energy-efficient performance scalability. I will use the Deterministic parallel Java language
Over the last five years, major microprocessor manufacturers have released plans for a rapidly increasing number of cores per microprossesor, with upwards of 64 cores by 2015. In this setting, a sequential RAM compute...
详细信息
ISBN:
(纸本)9781595939739
Over the last five years, major microprocessor manufacturers have released plans for a rapidly increasing number of cores per microprossesor, with upwards of 64 cores by 2015. In this setting, a sequential RAM computer will no longer accurately reflect the architecture on which algorithms are being executed. In this paper we propose a model of low degree parallelism (LoPRAM) which builds upon the RAM and PRAM models yet better reflects recent advances in parallel (multi-core) architectures. This model supports a high level of abstraction that simplifies the design and analysis of parallel programs. More importantly we show that in many instances it naturally leads to work-optimal parallelalgorithms via simple modifications to sequential algorithms.
In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, w...
详细信息
ISBN:
(纸本)9781595939739
In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallelalgorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with almost optimal work with respect to sequential algorithms, i.e., it uses O(n(omega)) processors, where om...
详细信息
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with almost optimal work with respect to sequential algorithms, i.e., it uses O(n(omega)) processors, where omega is the matrix multiplication exponent. Our algorithm is based on an RNC algorithm for computing determinant of a degree one polynomial matrix which is of independent interest.
There are many algorithms to solve large sparse linear systems in parallel;however, most of them acquire synchronization and thus are lack of scalability. In this paper, we propose a new distributed numerical algorith...
详细信息
ISBN:
(纸本)9781595939739
There are many algorithms to solve large sparse linear systems in parallel;however, most of them acquire synchronization and thus are lack of scalability. In this paper, we propose a new distributed numerical algorithm, called Directed Transmission Method (DTM). DTM is a fully asynchronous, scalable and continuous-time iterative algorithm to solve the arbitrarily-large sparse linear system whose coefficient matrix is symmetric-positive-definite (SPD). DTM is able to be freely running on the heterogeneous parallel computer with arbitrary number of processors, which might be manycore microprocessors, clusters, grids, clouds, and the Internet. We proved that DTM is convergent by making use of the final value theorem of Laplacian Transformation. Numerical experiments show that DTM is efficient.
We give a randomized (Las-Vegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log(2) n) multi-source parallel reachability querie...
详细信息
ISBN:
(纸本)9781595939739
We give a randomized (Las-Vegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log(2) n) multi-source parallel reachability queries;i.e. O(log(2) n) calls to a subroutine that computes the union of the descendants of a given set of vertices in a given digraph. Our algorithm also topologically sorts the strongly connected components. Using Ullman and Yannakakis's [22] techniques for the reachability subroutine gives our algorithm runtime (O) over tilde (t) using mn/t(2) processors for any (n(2)/m)(1/3) <= t <= n. On sparse graphs, this improves the number of processors needed to compute strongly connected components and topological sort within time n(1/3) <= t <= n from the previously best known (n/t)(3) [20] to (n/t)(2).
We present Dreadlocks. an efficient new shared-memory spin lock that actively detects deadlocks. Instead of spinning on a Boolean value, each thread spins on the lock owner's per-thread digest, a compact represent...
详细信息
ISBN:
(纸本)9781595939739
We present Dreadlocks. an efficient new shared-memory spin lock that actively detects deadlocks. Instead of spinning on a Boolean value, each thread spins on the lock owner's per-thread digest, a compact representation of a portion of the lock's waits-for graph. Digests can be implemented either as bit vectors (for small numbers of threads) or as Bloom filters (for larger numbers of threads). Updates to digests are propagated dynamically as locks are acquired and released. Dreadlocks can be applied to any spin lock algorithm that allows threads to time out. Experimental results show that Dreadlocks outperform timeouts under many circumstances, and almost never do worse.
We present a mechanism for partially aborting transactions through the use of data structure checkpoints and control-flow continuations. In particular, we show that boosted transactions [9] already have built-in resto...
详细信息
ISBN:
(纸本)9781595939739
We present a mechanism for partially aborting transactions through the use of data structure checkpoints and control-flow continuations. In particular, we show that boosted transactions [9] already have built-in restoration points and afford a simple, efficient implementation. Our mechanism is far simpler than previous work, which relied on complex nesting schemes to establish checkpoints. We demonstrate syntactic advantages and we quantify the overhead of checkpoints and explore several examples, illustrating the utility of partially aborting transactions. We additionally present a novel queue-based spin lock which allows threads to timeout and differ in priority. Unlike the known lock due to Craig [5], our lock is more efficient for priority schemes of few levels.
Dynamic information flow tracking (DIFT) is an important tool for detecting common security attacks and memory bugs. A DIFT tool tracks the flow of information through a monitored program's registers and memory lo...
详细信息
ISBN:
(纸本)9781595939739
Dynamic information flow tracking (DIFT) is an important tool for detecting common security attacks and memory bugs. A DIFT tool tracks the flow of information through a monitored program's registers and memory locations as the program executes, detecting and containing/fixing problems on-the-fly. Unfortunately, sequential DIFT tools are quite slow, and DIFT is quite challenging to parallelize. In this paper, we present a new approach to parallelizing DIFT-like functionality. Extending our recent work on accelerating sequential DIFT, we consider a variant of DIFT that tracks the information flow only through unary operations (relaxed DIFT), and yet makes sense for detecting security attacks and memory bugs. We present a parallel algorithm for relaxed DIFT, based on symbolic inheritance tracking, which achieves linear speed-up asymptotically. Moreover, we describe techniques for reducing the constant factors, so that speed-ups can be obtained even with just a few processors. We implemented the algorithm in the context of a Log-Based architectures (LBA) system, which provides hardware support for logging a program trace and delivering it to other (monitoring) processors. Our simulation results on SPEC benchmarks and a video player show that our parallel relaxed DIFT reduces the overhead to as low as 1.2X using 9 monitoring cores on a 16-core chip multiprocessor.
暂无评论