The main contribution of this work is to fathom the power and flexibility of the Mesh with Hybrid Buses via simulation. We propose two algorithms that perform an O(1) time stepwise simulation of an N-processor dynamic...
详细信息
The main contribution of this work is to fathom the power and flexibility of the Mesh with Hybrid Buses via simulation. We propose two algorithms that perform an O(1) time stepwise simulation of an N-processor dynamic Priority CRCW-PRAM endowed with M memory cells. Our first algorithm uses a Mesh with Hybrid Buses of size max{N, MN/sup /spl epsiv//2/}/spl times/MN/sup /spl epsiv//2/ for some fixed constant /spl epsiv/, 0
This paper proposes a parallel branch and bound algorithm designed for solving the Vehicle Routing Problem (VRP) on NOWs (Networks of Workstations). Our objective is to minimize the execution time by considering paral...
详细信息
This paper proposes a parallel branch and bound algorithm designed for solving the Vehicle Routing Problem (VRP) on NOWs (Networks of Workstations). Our objective is to minimize the execution time by considering parallel implementation to find an exact solution to the VRP in real time. Our experimental studies reveal that the proposed parallel branch and bound algorithm can achieve super-linear speedup for large problem sizes. Dynamic load balancing techniques for solving the VRP on NOWs are also discussed.
The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer, who has intimate knowledge of the application characteristics. On the other han...
详细信息
The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer, who has intimate knowledge of the application characteristics. On the other hand, the task of limiting the parallelism in a chosen parallel algorithm is best handled by the compiler or operating system for the target MPP machine. Toward this end, we have developed CASS (for Clustering And Scheduling System), a task management system that provides facilities for automatic granularity optimization and task scheduling of parallel programs on distributed memory parallelarchitectures. Our tool environment, CASS, consists of a two-phase method of compiler-lime scheduling in which task clustering is performed prior to the actual scheduling process. The clustering module identifies the optimal number of processing nodes that the program will require to obtain maximum performance on the target parallel machine. The scheduling module maps the clusters onto a fixed number of processors and determines the order of execution of tasks in each processor.
In this paper a new method is proposed to synchronize massively parallel processes in distributed multiprocessor systems. The method is an extension of that used in arbitration systems like Futurebus+. It also uses th...
详细信息
In this paper a new method is proposed to synchronize massively parallel processes in distributed multiprocessor systems. The method is an extension of that used in arbitration systems like Futurebus+. It also uses three global synchronization lines and a distributed synchronizer, and can be applied without changes to the existing hardware. The method allows to carry out two alternative synchronization protocols where the end of the operation is either forced by any processor or individual a moment when all processors completed their individual parts. Application of this method, e.g., to the arbitration process allows to reduce the arbitration time in average by a factor of 2.
Consider a pyramid with n levels and a k-dimensional hypercube, 0/spl les/k/spl les/2n-2. The paper presents a parallel algorithm for embedding large pyramids into smaller hypercubes with load balancing. With dilation...
详细信息
Consider a pyramid with n levels and a k-dimensional hypercube, 0/spl les/k/spl les/2n-2. The paper presents a parallel algorithm for embedding large pyramids into smaller hypercubes with load balancing. With dilation 4, congestion at most 2/sup n-k/2/+4, and load [2/sup 2n-k//3] when k is even, our algorithm embeds the pyramid into the hypercube, otherwise, with the same dilation and load, it has congestion 2/sup n-(k+1)///sup 2+1/+6 when k is odd. The algorithm can be performed in O(k)-bit time.
In this paper we propose a parallel and distributed genetic algorithms (PDGA) on fixed network topology multiprocessor systems in which each processor element carries out genetic operations on its own chromosome set a...
详细信息
In this paper we propose a parallel and distributed genetic algorithms (PDGA) on fixed network topology multiprocessor systems in which each processor element carries out genetic operations on its own chromosome set and communicates with only the neighbors (we say chromosome migration). We execute the proposed method to investigate effects of chromosome migration on the multiprocessor systems with ring, torus, and hypercube topology for benchmark problem instances. From the results, we find that the ring topology is more suitable far our proposed parallel and distributed execution since it avoids immature convergence for its topological feature. We show its effectiveness by experimental evaluation.
With the move towards parallel processing on Clusters of Workstations (COWs) the ability to fully utilise the computational resources of all workstations through the initial placement and movement of workload is desir...
详细信息
With the move towards parallel processing on Clusters of Workstations (COWs) the ability to fully utilise the computational resources of all workstations through the initial placement and movement of workload is desirable by the user of the system, to improve the overall performance. This is a critical task when many users run their parallel programs simultaneously. The ability to have an even spread of load over an entire COW can be achieved through the employment of Global Scheduling. This paper introduces the concept of global scheduling exploiting static allocation and dynamic load balancing along with the implementation of such a facility and support services, remote process creation and process migration, respectively, on the RHODOS distributed operating system. The suitability of the RHODOS implementation is demonstrated by results obtained from initial test runs of a SPMD parallel application.
Non-blocking multithreaded architectures have been proposed as an effective means to overlap computation and communication in distributed memory systems. However, in these models, communication latency could only be h...
详细信息
Non-blocking multithreaded architectures have been proposed as an effective means to overlap computation and communication in distributed memory systems. However, in these models, communication latency could only be hidden from computation as long as there are enough ready threads. We describe our I-Structure Software Cache (ISSC) run-time system, which takes advantage of the global data locality in these models without adding any specific hardware support. Our ISSC provides a communication latency reduction technique in non-blocking multithreaded architectures while maintaining the ability of tolerating communication latency. Our simulation results show that our ISSC run-time system dramatically decreases network traffic by caching remote requests and also achieves 70% to 95% improvement on the system performance (speedup).
High bandwidth and low latency switches are commercially available. Using these switches, it has become a popular and cost-effective way to interconnect workstations together as a parallel computing platform. The inte...
详细信息
ISBN:
(纸本)0818682596
High bandwidth and low latency switches are commercially available. Using these switches, it has become a popular and cost-effective way to interconnect workstations together as a parallel computing platform. The interconnection topology is usually irregular. On such systems, multicast is an important collective communication operation. In this paper, we study how to perform multicast on irregular switch-based networks assuming no special hardware support. Our proposed multicast algorithm is implemented completely at the software level by sending multiple unicast messages. We also prove that it is optimal in terms of the number of communication steps to accomplish the multicast without channel contention in each step. Performance of the proposed algorithm is evaluated by simulation.
In this paper we show how to obtain optimal speedup in a master-slave model for solving data-parallel problems. Given the number of homogeneous workstations, their computation time for solving a basic sub-task of the ...
详细信息
ISBN:
(纸本)0818682596
In this paper we show how to obtain optimal speedup in a master-slave model for solving data-parallel problems. Given the number of homogeneous workstations, their computation time for solving a basic sub-task of the problem, network transmission bandwidth and data volume per basic sub-task, the per-distribution number of basic sub-tasks sent to a slave for attaining the optimal speedup can be decided. The effectiveness of the proposed theory has been tested using a parallel computing experiment involving the Hough transformation.
暂无评论