Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughpu...
详细信息
Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughput. In this paper, we present and analyze a queueing theoretic model for a general gang scheduling scheme that forms the basis of a multiprogramming environment currently being developed for IBM's SP2 parallel system and for clusters of workstations. Our model and analysis can be used to tune our scheduler in order to maximize its performance on each hardware platform.
The current paper provides preliminary statements of the panelists ahead of a panel discussion at the acm SPAA 2021 conference on the topic: algorithm-friendly architecture versus architecture-friendly algorithms. ...
详细信息
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting impl...
详细信息
ISBN:
(纸本)9780897918909
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achieves significantly better speedups over good serial code, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for a loosely-coupled cluster of workstations, a distributed-memory multicomputer, and a shared-memory multiprocessor. The Machiavelli toolkit used to transform the nested data parallelism inherent in the divide-and-conquer algorithm into achievable task and data parallelism is also described and compared to previous techniques.
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with almost optimal work with respect to sequential algorithms, i.e., it uses O(n(omega)) processors, where om...
详细信息
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with almost optimal work with respect to sequential algorithms, i.e., it uses O(n(omega)) processors, where omega is the matrix multiplication exponent. Our algorithm is based on an RNC algorithm for computing determinant of a degree one polynomial matrix which is of independent interest.
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. W...
详细信息
ISBN:
(纸本)9780897918909
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. We present a novel family of algorithms, that achieve a better trade-off between the number of start-ups and the routing volume. We have implemented them on an Intel Paragon, and they turn out to considerably outperform all earlier algorithms: with P = 2 the sequential algorithm is already beaten for N = 25,000;for P = 100 and N = 107, the speed-up is 21, and for N = 108 it even reaches 30. A modification of one of our algorithms solves a theoretical question: we show that on one-dimensional processor arrays, list ranking can be solved with a number of steps equal to the diameter of the network.
There are many algorithms to solve large sparse linear systems in parallel;however, most of them acquire synchronization and thus are lack of scalability. In this paper, we propose a new distributed numerical algorith...
详细信息
ISBN:
(纸本)9781595939739
There are many algorithms to solve large sparse linear systems in parallel;however, most of them acquire synchronization and thus are lack of scalability. In this paper, we propose a new distributed numerical algorithm, called Directed Transmission Method (DTM). DTM is a fully asynchronous, scalable and continuous-time iterative algorithm to solve the arbitrarily-large sparse linear system whose coefficient matrix is symmetric-positive-definite (SPD). DTM is able to be freely running on the heterogeneous parallel computer with arbitrary number of processors, which might be manycore microprocessors, clusters, grids, clouds, and the Internet. We proved that DTM is convergent by making use of the final value theorem of Laplacian Transformation. Numerical experiments show that DTM is efficient.
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties w...
详细信息
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties when processors are running at different speeds, i.e. correlated by a multiplicative constant factor (our weak definition of heterogeneous platform). One scheme is under study: parallel sorting by sampling (either regular sampling technique introduced by Shi and Schaeffer [J. parallel Distrib. Comput. 14 (4) (1992) 361] or the over-partitioning scheme introduced by Li and Seveik [parallel sorting by over-partitioning, in: Proceedings of the Sixth annualsymposium on parallelalgorithms and architectures, acm Press, New York, June 1994]). What is important in the paper is mainly the load balance factor and not necessary the execution time. It is clear that improved load balance leads to improved execution titre. The results presented in the paper demonstrate that load balancing for the case of computers with heterogeneous processing capacity is more challenging than for the homogeneous case. The survey, through the sorting case study, allow us to identify some algorithmic issues and software challenges to master heterogeneous cluster platforms in order to better utilize theta: data decomposition techniques, scheduling and load balancing methods. (C) 2002 Elsevier Science B.V. All rights reserved.
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP and LOGP) account for bandwidth limitations...
详细信息
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP and LOGP) account for bandwidth limitations using a per-processor parameter g>1, such that each processor can send/receive at most h messages in g·h time. Other models (e.g., PRAM(m)) account for bandwidth limitations as an aggregate parameter mΩ(√lg p) separation known previously.
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity usi...
详细信息
ISBN:
(纸本)9780897919890
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to model a large variety of systems and to provide accurate numerical approximations of system behavior even when the number of processors is relatively small. We show how this approach can yield significant intuition about the behavior of work stealing algorithms in realistic settings.
The issue of effectiveness of private caches for processors were studied. Since time for all processors to access the shared memory simultaneously is usually much longer than the time for a processor to access its own...
详细信息
ISBN:
(纸本)9780897918091
The issue of effectiveness of private caches for processors were studied. Since time for all processors to access the shared memory simultaneously is usually much longer than the time for a processor to access its own private cache, scheduling with private caches falls into the distributed memory model where the lower bound applies. The effectiveness of private caches were shown by proving that a version of Dynamic Equi-partition Scheduling Policy (DEQ) achieves a mean response time with five times the optimal mean response time in the cache clock time for a large class of parallel jobs well accepted in the parallel scheduling community. This shows an improvement of system performance by using private caches over that of purely shared memory.
暂无评论