In this paper, a parallel Object Collection (POC) model is introduced to support data parallelism in a parallel object-oriented system. this model is based on the idea of data partitioning and method replication. To a...
详细信息
ISBN:
(纸本)0769511538
In this paper, a parallel Object Collection (POC) model is introduced to support data parallelism in a parallel object-oriented system. this model is based on the idea of data partitioning and method replication. To achieve load-balancing, partition objects are dynamically migrated at runtime according to the system load situation. A threshold-based strategy is used in the dynamic load-balancing. To avoid over-convergence of load during partition object migration, a new destination node selection algorithm is proposed. the threshold values used in the algorithm are also adaptively adjusted to better reflect the fluctuation of the load during execution. To evaluate the performance of the dynamic load balancing algorithm, simulation experiments are conducted. the simulation results are reported and discussed in the paper.
the user-level communication is investigated by many researches, in order to resolve degradation of cluster systems due to inefficient communication protocols. It removes the kernel intervention from the critical comm...
详细信息
ISBN:
(纸本)0769511538
the user-level communication is investigated by many researches, in order to resolve degradation of cluster systems due to inefficient communication protocols. It removes the kernel intervention from the critical communication path. Recently, Intel, Microsoft and Compaq introduce the Virtual Interface Architecture (VIA), a standard for user-level communication. However, the existing VIA implementation shows low performance in transferring small messages, because it uses a single mechanism to transfer messages without regard to their message size. In this paper, we implement a high performance VIA, KVIA (Kaist VIA). KVIA, based on descriptor and message size, dynamically selects a proper transfer (1)mechanism. this implementation effectively handles not only large messages but also small messages. thus, it can be better applied to the systems that frequently use small messages (e.g., lock protocols for software distributed shared memory). the performance of KVIA is reported using round-trip latency and one-way bandwidth. Our results show the round-trip latency of 40 micro-seconds and the maximum one-way bandwidth of 950 Mbits per second, which is about 74% of Myrinet link's peak bandwidth.
Sort can be speeded up on parallel computers by dividing and computing data individually in parallel. Bitonic sorting call be parallelized, however, a great portion of execution time is consumed due to O(log(2)P) rime...
详细信息
ISBN:
(纸本)0769511538
Sort can be speeded up on parallel computers by dividing and computing data individually in parallel. Bitonic sorting call be parallelized, however, a great portion of execution time is consumed due to O(log(2)P) rime of data exchange of N/P keys where P, N are the number of processors and keys, respectively. this paper presents rut efficient way of data communication in bitonic sort to minimize the interprocessor communication and computation time. Before actual data movement, each pair processors exchange the minimum and maximum in its list of keys to determine what keys are to be sent to its partner. Very often no keys need to exchange, or only a fraction of them are exchanged At least 20% or greater of execution time could be reduced on T3E computer ill our experiments. We believe the scheme is a good way to shorten the communication time in similar applications.
Races might result in unintended nondeterministic execution of parallel programs and thus race detection is one of the critical issues to be resolved in debugging of shared-memory parallel programs. On-the-fly race de...
详细信息
ISBN:
(纸本)0769511538
Races might result in unintended nondeterministic execution of parallel programs and thus race detection is one of the critical issues to be resolved in debugging of shared-memory parallel programs. On-the-fly race detection techniques have been developed as one of approaches for the problem. However on-the-fly race detection techniques suffer from the huge run-time overhead because in which the whole execution behavior of the program being debugged must be monitored at run-time. In this paper we present a practical loop transform technique which can significantly reduce the monitoring overhead required for detecting races on-the-fly in parallel programs. Our technique achieves the improvement by minimizing the number of iteration counts to be monitored of each parallel loop by transforming the original fool, withthe technique. An experimental performance measurement of our technique shows dramatic improvement on the monitoring overhead and it detects more races than those detected by traditional on-the-fly techniques.
We consider two interrelated tasks in a synchronous n-node ring: distributed constant coloring and local communication. Every node knows the labels of nodes up to a distance r from it, called the knowledge radius. In ...
In this paper, we will provide the way to make n+1 node disjoint parallel path between any two node of HCN(n,n) which is better network cost than hyper-cube, and will prove that the fault diameter of HCN(n,n) is dia(H...
详细信息
ISBN:
(纸本)0769511538
In this paper, we will provide the way to make n+1 node disjoint parallel path between any two node of HCN(n,n) which is better network cost than hyper-cube, and will prove that the fault diameter of HCN(n,n) is dia(HCN(n,n))+4 by result. these parallel paths can reduce the time of transmitting messages between nodes, and they mean that if some nodes of HCN(n,n) would fail, there is still no communication delay time. Also, by analyzing the fault tolerance of interconnection network HCN(n,n), we will prove that there is maximally fault tolerance.
An n-dimensional hierarchical cubic network (denoted by HCN(n)) contains 2(n) n-dimensional hypercubes. the diameter of an HCN(n), which is equal to n+[(n+1)/3]+1, is about two-thirds the diameter of a comparable hyye...
详细信息
ISBN:
(纸本)0769511538
An n-dimensional hierarchical cubic network (denoted by HCN(n)) contains 2(n) n-dimensional hypercubes. the diameter of an HCN(n), which is equal to n+[(n+1)/3]+1, is about two-thirds the diameter of a comparable hyyercube, although it uses about half as many links per node. In this paper, a maximal number of node-disjoint paths are constructed between every two distinct nodes of an HCN(n). their maximal length has an upper bound of n+[n/3]+4, which is nearly optimal. the (n+1)-wide diameter and n-fault diameter of an HCN(n) are shown to be n+ [n/3]+3 or n+[n/3]+4, which are about two-thirds those of a comparable hypercube. Our results reveal that an HCN(n) has shorter node-disjoint paths, wide diameter, and fault diameter than a comparable hyyercube.
At this turn of the century the objectoriented (OO) distributed real-time (RT) programming movement is growing rapidly along withthe networked embedded systems market. the motivations are reviewed and then a brief ov...
详细信息
One of the factors that can influence the performance of a DSM system is the efficiency of multipoint access on interconnection network. this work presents an evaluation of Brazos System, a software implemented distri...
详细信息
ISBN:
(纸本)0769511538
One of the factors that can influence the performance of a DSM system is the efficiency of multipoint access on interconnection network. this work presents an evaluation of Brazos System, a software implemented distributed shared memory (DSM) system designed for x86 SMP nodes running Windows NT and that takes advantage of multicast communication, with regard to different configurations of communication network: broadcast Fast Ethernet, switched Fast Ethernet and LANE-ATM. In this evaluation, although the bandwidth of the ATM is larger than the Fast Ethernet, a low performance using LANE-ATM was observed, compared to the Brazos System performance using broadcast or switched Fast Ethernet. It shows the importance of the communication subsystem and the necessity of improvements on aspects like multipoint access, which has been not much explored on networks like ATM.
this paper reports on efficient parallel implementations of two-dimensional Delaunay triangulation in High Performance Fortran (HPF) and in Message Passing Interface (MPI). Our parallelization algorithm performs subbl...
详细信息
暂无评论