parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplicati...
详细信息
ISBN:
(纸本)9781450312134
parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. Copyright 2012 acm.
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In thi...
详细信息
ISBN:
(纸本)9781450312134
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertex's previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallelalgorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism. Copyright 2012 acm.
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefini...
详细信息
ISBN:
(纸本)9781450312134
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefinite and all scalars are nonnegative. At FOCS'11, Jain and Yao gave an NC algorithm that requires O(1/Ε13 log13 mlog n) iterations on input n constraint matrices of dimension m-by-m, where each iteration performs at least Δ(mω) work since it involves computing the spectral decomposition. We present a simpler NC parallel algorithm that on input with n constraint matrices, requires O(1/Ε4 log4 n log(1/Ε )) iterations, each of which involves only simple matrix operations and computing the trace of the product of a matrix exponential and a positive semidefinite matrix. Further, given a positive SDP in a factorized form, the total work of our algorithm is nearly-linear in the number of non-zero entries in the factorization. Our algorithm can be viewed as a generalization of Young's algorithm and analysis techniques for positive linear programs (Young, FOCS'01 ) to the semidefinite programming setting. Copyright 2012 acm.
Programs written for an architecture with n processors require a re-write when migrated to an m processor architecture {(m > n)} to benefit from additional resources. Compiler based solutions do not match manual op...
详细信息
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of inno...
详细信息
ISBN:
(纸本)9781450312134
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of innovation, the future will need to be increasingly parallel- involving parallelism across data, threads, cores, and nodes. This talk will explore some of the dimensions of parallelism, and the opportunities and challenges they pose.
Creating components based on concurrent and parallelalgorithms and data structures often requires more attention to "engineering" issues not seen with most other libraries. Components created in the "o...
详细信息
ISBN:
(纸本)9781450312134
Creating components based on concurrent and parallelalgorithms and data structures often requires more attention to "engineering" issues not seen with most other libraries. Components created in the "obvious" way sometimes turn out to be wrong, to perform poorly, or are unusable in most applications, because the abstractions in which they are expressed are leaky, imprecise, or incorrectly specified. While many of these issues are encountered in other aspects of concurrent programming, the impact is accentuated when they continue to leak through to APIs provided by library components. This presentation surveys some examples and lessons learned from the design, implementation, and applications of the *** library, including those surrounding memory models, resource management and garbage collection, thread management, optimization, and code generation.
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's work...
详细信息
ISBN:
(纸本)9781450307437
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's workflow" and targeting XMT, a PRAM-inspired many-core architecture, we achieve significantly higher speed-ups than previous approaches. Comparison with the fastest known serial max-flow implementation on a modern CPU demonstrates for the first time potential for orders-of-magnitude performance improvement for Max-Flow. Using XMT, the PRAM Max-Flow algorithm is also much easier to program than for other parallel platforms, contributing a powerful example toward dual validation of both PRAM algorithmics and XMT.
暂无评论