This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (X...
详细信息
ISBN:
(纸本)9780897919890
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced;new elements are added where needed.
The proceedings contain 49 papers. The topics discussed include: fast stencil computations using fast Fourier transforms;low-span parallelalgorithms for the binary-forking model;provable advantages for graph algorith...
ISBN:
(纸本)9781450380706
The proceedings contain 49 papers. The topics discussed include: fast stencil computations using fast Fourier transforms;low-span parallelalgorithms for the binary-forking model;provable advantages for graph algorithms in spiking neural networks;algorithms for right-sizing heterogeneous data centers;efficient parallel self-adjusting computation;speed scaling with explorable uncertainty;efficient online weighted multi-level paging;paging and the address-translation problem;massively parallelalgorithms for distance approximation and spanners;efficient load-balancing through distributed token dropping;finding subgraphs in highly dynamic networks;near-optimal time-energy trade-offs for deterministic leader election;and efficient stepping algorithms and implementations for parallel shortest paths.
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this ...
详细信息
ISBN:
(纸本)9780897918909
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this is necessarily true. Typically, the cost models proposed for external memory algorithms have measured only the number of I/O operations, and the algorithms have been specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work based on parallelalgorithms to EM, but with limited success. In this paper we provide simulation techniques which produce efficient EM algorithms from efficient algorithms developed under BSP-like parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. In addition to the main simulation result we obtain a more comprehensive cost model for EM algorithms, which considers the total costs incurred by the algorithm including computation, I/O and communication costs.
This paper presents a work-optimal CGM algorithm that solves the Longest Increasing Subsequence Problem. It can be implemented in the CGM with P processors in O(N2/P) time and O(P) communication steps. It is the first...
详细信息
This paper presents a work-optimal CGM algorithm that solves the Longest Increasing Subsequence Problem. It can be implemented in the CGM with P processors in O(N2/P) time and O(P) communication steps. It is the first CGM algorithm for this problem and it is work-optimal since the sequential algorithm has a complexity of O(N2).
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties w...
详细信息
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties when processors are running at different speeds, i.e. correlated by a multiplicative constant factor (our weak definition of heterogeneous platform). One scheme is under study: parallel sorting by sampling (either regular sampling technique introduced by Shi and Schaeffer [J. parallel Distrib. Comput. 14 (4) (1992) 361] or the over-partitioning scheme introduced by Li and Seveik [parallel sorting by over-partitioning, in: Proceedings of the Sixth annualsymposium on parallelalgorithms and architectures, acm Press, New York, June 1994]). What is important in the paper is mainly the load balance factor and not necessary the execution time. It is clear that improved load balance leads to improved execution titre. The results presented in the paper demonstrate that load balancing for the case of computers with heterogeneous processing capacity is more challenging than for the homogeneous case. The survey, through the sorting case study, allow us to identify some algorithmic issues and software challenges to master heterogeneous cluster platforms in order to better utilize theta: data decomposition techniques, scheduling and load balancing methods. (C) 2002 Elsevier Science B.V. All rights reserved.
We present an embedding of the complete binary tree with n leaves in the √n × √n mesh, for any n = 22m where m is a positive integer. The embedding has the following properties: at most two tree nodes (one of w...
详细信息
ISBN:
(纸本)089791483X
We present an embedding of the complete binary tree with n leaves in the √n × √n mesh, for any n = 22m where m is a positive integer. The embedding has the following properties: at most two tree nodes (one of which is a leaf) are mapped onto each mesh node, paths of the tree are mapped onto edge-disjoint paths in the mesh (each mesh edge being considered as two anti-parallel directed edges) and the maximum distance from a leaf to the root of the tree is √n + O(log n) mesh steps. This embedding facilitates efficient implementation of many P-RAM algorithms on the mesh, particularly those using the balanced binary tree technique. Such an embedding offers greater flexibility of use and improves the time complexity of these implementations by a constant factor compared with previously described embeddings.
The use of efficient Galois Field Arithmetic on SIMD architecture was presented. SIMD architectures were used for obtaining high speed implementation in the fields where data parallelism was encountered. In regard wit...
详细信息
ISBN:
(纸本)9781581136616
The use of efficient Galois Field Arithmetic on SIMD architecture was presented. SIMD architectures were used for obtaining high speed implementation in the fields where data parallelism was encountered. In regard with it, the role played by the bit-slicing procedure in the computations was also discussed.
We introduce a new preconditioner for solving a symmetric Toeplitz system of equations by the conjugate gradient method. This choice leads to an algorithm which is particularly suitable for parallel computations and, ...
详细信息
ISBN:
(纸本)0897913701
We introduce a new preconditioner for solving a symmetric Toeplitz system of equations by the conjugate gradient method. This choice leads to an algorithm which is particularly suitable for parallel computations and, compared to the circulant preconditioner of [C3], has a better asymptotic convergence rate and a lower arithmetic cost per iteration.
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new ana...
详细信息
ISBN:
(纸本)9780897918909
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new analyses for several such dynamic randomized load balancing schemes. Unlike previous analyses, we do not assume that in equilibrium each server is stochastically independent from other servers. Our work extends a previous analysis of the supermarket model, a model that abstracts a simple, efficient load balancing scheme in the setting where jobs arrive at a large system of parallel processors. In this model, customers arrive at a system of n servers as a Poisson stream of rate λn, λ
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to ...
详细信息
ISBN:
(纸本)9780897917179
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. When executing such programs, the major concern is to dynamically schedule tasks to processors in order to minimize execution time and the amount of memory needed. In this paper, a class of parallel schedules that are provably efficient in both time and space, even for programs whose task structure is revealed only during execution are identified. Following this, an efficient dynamic scheduling algorithm that generates schedules in this class, for languages with nested fine-grained parallelism is described.
暂无评论