Experimental data validating some of the proposed parallel computation models on the Intel Paragon is presented. This architecture is characterized by a large bandwidth and a relatively large startup cost of a message...
详细信息
ISBN:
(纸本)0818684038
Experimental data validating some of the proposed parallel computation models on the Intel Paragon is presented. This architecture is characterized by a large bandwidth and a relatively large startup cost of a message transmission, which makes it extremely important to employ bulk transfers. The models considered are the BSP model, in which it is assumed that all messages have a fixed short size, and the BPRAM, in which block transfers are rewarded.
Processor scheduling in distributed-memory systems has received considerable attention in recent years. Several commercial distributed-memory systems use space-sharing processor scheduling. in space-sharing, the set o...
详细信息
ISBN:
(纸本)0818684038
Processor scheduling in distributed-memory systems has received considerable attention in recent years. Several commercial distributed-memory systems use space-sharing processor scheduling. in space-sharing, the set of processors in a system is partitioned and each partition is assigned for the exclusive use of a job. Space-sharing policies can be divided into fixed, static, or dynamic categories. For distributed-memory systems, dynamic policies incur high overhead. Thus, static policies are considered as these policies provide a better performance than the fixed policies. Several static policies have been proposed in the literature. In a previously proposed adaptive static policy, the partition size is a function of the number of queued jobs. This policy, however, tends to underutilize the system resources. To improve the performance of this policy, we propose a new policy in which the partition size is a function of the total number of jobs in the system, as opposed to only the queued jobs. The results presented here demonstrate that the new policy performs substantially better than the original policy for the various workload and system parameters. Another major contribution is the evaluation of the performance sensitivity to job structure, variances in inter-arrival times and job service times, and network topology.
Existing approaches to debugging distributedsystems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicat...
详细信息
ISBN:
(纸本)0818684038
Existing approaches to debugging distributedsystems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicate control approach involves a cycle of observation followed by controlled replaying of computations, based on observation. We formalize the predicate control problem for both off-line and on-line scenarios. We prove that off-line predicate control for general boolean predicates is NP-hard. However;we provide an efficient solution for off-line predicate control for the class of disjunctive predicates. We further solve on-line predicate control for disjunctive predicates under certain restrictions on the system. Lastly, we demonstrate how both off-line and on-line predicate control facilitate distributed debugging by allowing the programmer to control computations to maintain global safety properties.
Advances in optical technology have increased the interest for multiprocessor architectures based on lightwave networks because of the vast bandwidth available. in this paper we propose a passive star multi-hop lightw...
详细信息
ISBN:
(纸本)0818684038
Advances in optical technology have increased the interest for multiprocessor architectures based on lightwave networks because of the vast bandwidth available. in this paper we propose a passive star multi-hop lightwave network called stack-Kautz, based on the Kautz graph. We show that this architecture is very cost-effective with respect to its resources requirements. We also propose control protocols for accessing the optical passive star couplers, which improve on the bit complexity of the control sequence proposed in the literature for the Partitioned Optical Passive Star network Finally, we show through simulation that these control protocols efficiently implement shortest path routing on the stack-Kautz network.
We study the problem of mapping the N nodes of a complete t-ary tree on M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. Typical templates for accessing trees are su...
详细信息
ISBN:
(纸本)0818684038
We study the problem of mapping the N nodes of a complete t-ary tree on M memory modules so that they can be accessed in parallel by templates, i.e. distinct sets of nodes. Typical templates for accessing trees are subtrees, root-to-leaf paths, or levels which will be referred to as elementary templates. In this paper, we first propose a new mapping algorithm for accessing both paths and subtrees of size M with an optimal number of conflicts'(i.e., only one conflict) when the number of memory modules is limited to M. We also propose another mapping algorithm for a composite template, say V las versatile), such that its size is not fixed and an instance of V is composed of any combination of c instances of elementary templates. The number of conflicts for accessing an S-node instance of template V is O(s/root M log M + c) memory load as 1+ o(1) where load is defined as the ratio between the maximum and minimum number of data items mapped onto each memory module.
Thread migration is established as a mechanism for achieving dynamic load sharing. However;fine-grained migration has not been used due to the high thread and messaging overheads. This paper describes a fine-grained t...
详细信息
ISBN:
(纸本)0818684038
Thread migration is established as a mechanism for achieving dynamic load sharing. However;fine-grained migration has not been used due to the high thread and messaging overheads. This paper describes a fine-grained thread migration system whose extensible event mechanism permits an efficient interface between threads and communications, without compromising the modularity and performance of either: Migration is supported by user level primitives based on which applications may implement different migration policies. The system is portable and can be used directly or serve as a compilation target for parallel languages. The system runs on a cluster of SMPs and observed performance is orders of magnitude better than other reported measurements.
The Loop-Level Process Control (LLPC) policy [9] dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This st...
详细信息
ISBN:
(纸本)0818684038
The Loop-Level Process Control (LLPC) policy [9] dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modified version of the Solaris operating system. The parallelism of the applications is then dynamically adjusted a automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources.
Predicting the running time of a parallel program is useful for determining the optimal values for the parameters of the implementation and the optimal mapping of data on processors. However deriving an explicit formu...
详细信息
ISBN:
(纸本)0818684038
Predicting the running time of a parallel program is useful for determining the optimal values for the parameters of the implementation and the optimal mapping of data on processors. However deriving an explicit formula for the running time of a certain parallel program is a difficult task. We present a new method for the analysis of parallel programs: simulating the execution of parallel programs by following their control flow and by determining, for each processor the sequence of send and receive operations according to the LogGP model. We developed two algorithms to simulate the LogGP communication between processors and we tested them on the blocked parallel version of the Gaussian Elimination algorithm on the Meiko CS-2 parallel machine. Our implementation showed that the LogGP simulation is able to detect the nonlinear behavior of the program running times, to indicate the differences in running times for different data layouts and to find the local optimal value of the block size with acceptable precision.
Using a directed acyclic graph (dag) model of algorithms, we solve a problem related to precedence-constrained multiprocessor schedules for array computations: Given a sequence of dags and linear schedules parametrize...
详细信息
ISBN:
(纸本)0818684038
Using a directed acyclic graph (dag) model of algorithms, we solve a problem related to precedence-constrained multiprocessor schedules for array computations: Given a sequence of dags and linear schedules parametrized by n, compute a lower bound on the number of processors required by the schedule as a function of n. In our formulation, the number of tasks that are scheduled for execution during any fired time step is the number of non-negative integer solutions d(n) to a set of parametric linear Diophantine equations.: We illustrate an algorithm based on generating functions far constructing a formula for these numbers d(n). The algorithm has been implemented as a Mathematica program. An example run and the symbolic formula for processor lower bounds automatically produced by the algorithm for Gaussian Elimination is presented.
We compare the performance of systems consisting of one large cluster containing q processors with systems where processors are grouped into k clusters containing u processors each. A parallel program, consisting of n...
详细信息
ISBN:
(纸本)0818684038
We compare the performance of systems consisting of one large cluster containing q processors with systems where processors are grouped into k clusters containing u processors each. A parallel program, consisting of n processes, is executed on this system. Processes may be relocated between the processors in a cluster They may, however not be relocated from one cluster to another The performance criterion is the completion time of the parallel program. We present two functions: g(n,k, u, q) and G(k, u, q). Provided that we can find optimal or near optimal schedules, these functions put optimal upper bounds on the gain of using one cluster containing q processors compared to using k clusters containing u processors each. The function g(n,k,u,q) is valid for programs with n processes, whereas G(k,u,q) only depends on the two multiprocessor architectures. By evaluating g(n,k,u,q) and G(k, u, q) we show that the gain of increasing the cluster size from 1 to 2 and from 2 to 4 is relatively large. However;the gain of using clusters larger than 4 is very limited.
暂无评论