In this paper we investigate the problem of parallel evaluation of functional programs. We have developed a novel approach to deal with sharing in graph reduction. Share nodes are introduced to explicitly handle shari...
详细信息
ISBN:
(纸本)081864222X
In this paper we investigate the problem of parallel evaluation of functional programs. We have developed a novel approach to deal with sharing in graph reduction. Share nodes are introduced to explicitly handle sharing. By using share nodes, we have a garbage collection method that is on-the-fly (real time), parallel, distributed, and incremental. In our parallel graph reducer, copying can be done in parallel to make an exponential growth of program tree nodes. Since each node represents a simple operation, the growth of program trees is a natural distribution of computation work. Load balancing becomes very easy and can be done automatically. A simulator is implemented to simulate a tree structured parallel computer to run our graph reducer. Examples such as parallel matrix addition and multiplication are tested.
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. this paper presents task assignment ...
详细信息
ISBN:
(纸本)081864222X
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. this paper presents task assignment heuristics for wormhole-routed systems. A Temporal Communication Graph is used to model task graphs and to identify spatial and temporal link contention. the interplay between degree of routing adaptivity, topology, application characteristics, and task assignment is studied by evaluating random task graphs using an event-driven simulator. the study indicates that even for systems supporting fully-adaptive routing, efficient task assignment is necessary to reduce program completion time especially for communication-bound applications.
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is...
详细信息
ISBN:
(纸本)081864222X
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is variable. When each iteration of a loop is treated as a sequential instruction stream, parallel execution of these loops is possible if the data dependency between each iteration can be resolved during run-time. In this paper, we proposed a parallel architecture which is able to resolve these kind of undetermined data dependencies. A notable feature of the proposed architecture is that it will initiate executing multiple instruction streams before the data dependency between the instruction streams, if any, has been resolved and will restart the execution if necessary.
the parallel State Processor (PSP) is intended as a design concept for the basic engine in large scale multiprocessors. Rather than switching between threads, PSP maintains a basic processor state which is itself a pa...
详细信息
ISBN:
(纸本)081864222X
the parallel State Processor (PSP) is intended as a design concept for the basic engine in large scale multiprocessors. Rather than switching between threads, PSP maintains a basic processor state which is itself a parallel conjunction of fetch and decode along multiple threads as well as the synchronization which occurs when operands are passed between them. this view gives rise to the possibility of an intelligent parallel state, i.e. one which dynamically maximizes the lifetime of the threads it comprises by having them provide one another withthe arguments they require over time. We present experimental data verifying that such behavior can be observed in actual code.
Load balancing is a task scheduling scheme for distributed computing systems, that transfer or idle processors to balance workload so that average response time can be reduced. Maintaining workload information is need...
详细信息
ISBN:
(纸本)081864222X
Load balancing is a task scheduling scheme for distributed computing systems, that transfer or idle processors to balance workload so that average response time can be reduced. Maintaining workload information is needed in making task migration decisions. However, it may incur large message overhead. In this paper, we adapt a strategy that assigns a sending set to each processor so that load information of a processor is sent only to the processors in the sending set. As a result, number of messages required for each update can be reduced to the square root of the number of processors in the system. Evaluation of the load balancing scheme is also presented in this paper.
this paper presents a solution to the (processor) group membership problem. the methodology followed in designing the algorithm is summarized by the option to optimize the performance of the algorithm under the assump...
详细信息
ISBN:
(纸本)081864222X
this paper presents a solution to the (processor) group membership problem. the methodology followed in designing the algorithm is summarized by the option to optimize the performance of the algorithm under the assumption that no failure occurs during its run, and adding the appropriate level of fault tolerance by implementing the algorithm as periodic and self-stabilizing. the performance indicators we consider are the number of message exchanges (between neighbor units) and the time needed to reach the agreement. the algorithm relies on the existence of three basic distributed services: clock synchronization, reliable diffusion, and local agreement between two neighbors. We do not assume the presence of a reliable datagram service, which is more complex to implement than the previous. the presence of a privileged unit (initiator) is avoided, so elections are not needed.
Recursive Diagonal Torus (RDT), a class of interconnection network is proposed for massively parallel computers with up to 216 nodes. By adding remote links to the diagonal directions of the torus network recursively,...
详细信息
ISBN:
(纸本)081864222X
Recursive Diagonal Torus (RDT), a class of interconnection network is proposed for massively parallel computers with up to 216 nodes. By adding remote links to the diagonal directions of the torus network recursively, the RDT can realize a smaller diameter (eg., it is 11 for 216 nodes) with small number of links per node (i.e., 8 links per node) than that of the hypercube. A simple routing algorithm called vector routing, which is near-optimal and easy to implement is also proposed. the RDT comprises the mesh structure, and emulates hypercube and tree structures easily. FFT and the bitonic sorting algorithm are also easy to implement.
Simulated annealing is known to be highly sequential due to loop carried dependencies. this report presents a new approach to parallel simulated annealing, called Generalized Speculative Computation (GSC). While the c...
详细信息
ISBN:
(纸本)081864222X
Simulated annealing is known to be highly sequential due to loop carried dependencies. this report presents a new approach to parallel simulated annealing, called Generalized Speculative Computation (GSC). While the conventional binary speculative computation can execute a maximum of (log n) iterations in parallel on n processors, the GSC can execute n iterations in parallel while maintaining the same decision sequence as a sequential version. To demonstrate the performance of GSC, we implement 100 to 500 city Traveling Salesman Problem on the AP1000 massively parallel machine. Actual experimental results demonstrate that the GSC is highly effective for the given problem as we obtain a nearly 20-fold speedup for the initial temperature of 1 on 100 processors.
Meerkat is a distributed memory multicomputer architecture that scales to hundreds of processors. Meerkat uses a two dimensional passive backplane to connect nodes composed of processors, memory, and I/O devices. the ...
详细信息
ISBN:
(纸本)081864222X
Meerkat is a distributed memory multicomputer architecture that scales to hundreds of processors. Meerkat uses a two dimensional passive backplane to connect nodes composed of processors, memory, and I/O devices. the interconnect is conceptually simple, inexpensive to design and build, has low latency, and provides high bandwidth on long messages. However, it does not scale to thousands of processors, does not provide high bandwidth on short messages, and does not provide cache coherent shared memory. Our hypothesis is that many general-purpose, database, and parallel numerical workloads work well on systems with Meerkat's characteristics. We describe the Meerkat architecture, the niche that Meerkat fills, the motivation behind our design choices, and give performance results obtained from our hardware prototype and a calibrated simulator.
In this paper we consider three fundamental communication problems on the star interconnection network, namely the multinode broadcast, the single node scattering, and the total exchange. All of these problems are stu...
详细信息
ISBN:
(纸本)081864222X
In this paper we consider three fundamental communication problems on the star interconnection network, namely the multinode broadcast, the single node scattering, and the total exchange. All of these problems are studied under two different assumptions: the assumption that each node can exchange messages of fixed length with one of its neighbors at each time step, or single link availability (SLA), and the assumption that each node can exchange messages of fixed length with all of its neighbors at each time step, or multiple link availability (MLA). All the communication algorithms presented in this paper are based on the construction of spanning trees with special properties on the star network to fit different communication needs.
暂无评论