Long-lived parallel applications running on workstation clusters are vulnerable to single-node or multiple-node failures. Fault recovery is therefore required to prevent immature program termination. However, much of ...
详细信息
Long-lived parallel applications running on workstation clusters are vulnerable to single-node or multiple-node failures. Fault recovery is therefore required to prevent immature program termination. However, much of the runtime overhead imposed by fault tolerance schemes is generally due to the cost of transferring the checkpoint states of applications by disk I/O operations. In this paper, we propose a fault tolerant model in which checkpoint states are transferred between replicated parallel applications. We also describe how the resource consumption of the replicated applications can be minimized. The fault tolerant model has been implemented and tested on a workstation cluster and a Fujitsu AP3000 multi-processor machine. The measurements of our experiments have showed that efficient fault tolerance can be achieved by replicating parallel applications on clusters of computers.
The proceedings contain 53 papers. The special focus in this conference is on parallelcomputing Technologies. The topics include: parallel computations on finite partially ordered sets;tight lower bounds for computin...
ISBN:
(纸本)3540633715
The proceedings contain 53 papers. The special focus in this conference is on parallelcomputing Technologies. The topics include: parallel computations on finite partially ordered sets;tight lower bounds for computing shortest paths on proper interval and bipartite permutation graphs;using run-time uncertainty to robustly schedule parallel computation;a tuple-based data structure for distributedparallel processing of 3D dynamic meshes;the appplication of parallel computations technique to the solution of certain hydrodynamic stability problems;a formal framework for the analysis of recursive-parallel programs;systematic design of 3-dimensional fixed-size array processors;on proving large distributed systems;influence of self-connection weights on cellular-neural network stability;estimating the parallel start-up overhead for parallelizing compilers;parallel and distributed evolutionary computation with MANIFOLD and parallel computation of fractal sets with the help of neural networks and cellular automata.
In a distributed-computing environment, it is important to ensure that the processor workloads are adequately balanced. Among numerous load-balancing algorithms, a unique approach due to Das and Prasad defines a symme...
详细信息
In a distributed-computing environment, it is important to ensure that the processor workloads are adequately balanced. Among numerous load-balancing algorithms, a unique approach due to Das and Prasad defines a symmetric broadcast network (SBN) that provides a robust communication pattern among the processors in a topology-independent manner. In this paper, we propose and analyze three SBN-based load-balancing algorithms, and implement them on an SP2. A thorough experimental study with Poisson-distributed synthetic loads demonstrates that these algorithms are very effective in balancing system load while minimizing processor idle time. They also compare favorably with several existing techniques.
Data parallel languages designed for distributed memory computing environments provide a single global address space to the programmer. The mapping from this global address space to the distributed local address space...
详细信息
Data parallel languages designed for distributed memory computing environments provide a single global address space to the programmer. The mapping from this global address space to the distributed local address space is performed by a compiler, which does this mapping based on the array distribution format. Thus, each array in a data parallel language program has its own distribution format. The Reshape function changes the array distribution format as well as the array shape. However, the changed distribution format cannot be represented by any distribution format supported in current languages. Because there is no suitable distribution format, it is necessary to change the reshaped distribution format to an existing distribution format, with heavy overhead due to the redistribution function. To eliminate the redistribution step in Reshape function, we have proposed a new distribution format, HIER-CYCLIC, which can represent the reshaped distribution format. We have also proposed a language syntax to use HIER-CYCLIC and a compiling mechanism. Finally, we performed an experiment on an IBM-SP2 machine using a shift function.
This paper describes a new protocol that helps the user in building reliable distributed applications with file operations. Our file checkpointing and recovery protocol is designed to consistently checkpoint and recov...
详细信息
ISBN:
(纸本)0818680679
This paper describes a new protocol that helps the user in building reliable distributed applications with file operations. Our file checkpointing and recovery protocol is designed to consistently checkpoint and recover user files with respect to the volatile state of the distributed program. Based on the protocol, a file I/O interface has been implemented as part of our Libra library for supporting fault tolerance in distributed applications. File operations are done using this interface whereas the complexity of checkpointing and recovering user files is hidden from the application level - the checkpointing and recovery of user files are done automatically.
With the recent advances in the communication technology and availability of powerful desktop computers, networking has gained popularity and many applications are being moved on the Internet. To ease the development ...
详细信息
With the recent advances in the communication technology and availability of powerful desktop computers, networking has gained popularity and many applications are being moved on the Internet. To ease the development of distributed applications, software support to facilitate coordination and communication is needed. This paper describes an object-oriented system for structured design and development of distributed applications. The basic system consists of a set of multithreaded servers, one server for each site in the network, which provide some basic communication facilities. The system has been developed using JAVA as the programming language. We use the software support provided by this basic system to define commonly used patterns of interaction in distributed applications. We also identify several different techniques for systematic composition of patterns to develop different applications. We illustrate the use of our system by defining some patterns and using them to build a sample application.
The proceedings contains 137 papers on High Performance computing on the Information Superhighway. Topics discussed include: stock processors;multithreaded parallel machines;Hamiltonian cycles;hypercube graphs;distrib...
详细信息
The proceedings contains 137 papers on High Performance computing on the Information Superhighway. Topics discussed include: stock processors;multithreaded parallel machines;Hamiltonian cycles;hypercube graphs;distributed shared memory;doubly chordal graphs;three dimensional virtual space;voronoi diagram;computed tomography;hierarchical bus based systems;homogenization method;task parallel language;single program multiple data;process communication graph;visual programming;tracing systems;workstation clustering;Crout factorization;hybrid full map directory schemes;geostationary satellites;and multistage interconnection networks.
The use of modular visualization and immersive virtual environments to inspect and analyze multiple geostatistical descriptions of a reservoir is presented. Constrained geostatic simulations are performed in parallel ...
详细信息
The use of modular visualization and immersive virtual environments to inspect and analyze multiple geostatistical descriptions of a reservoir is presented. Constrained geostatic simulations are performed in parallel on an IBM SP2 and displayed in real time either on graphic workstation or in an immersive environment. The rendering and volume manipulation is performed using a modular visualization environment. Realizations are rendered in a distributedcomputing environment by defining execution groups, assigned to a set of workstations, of visualization tasks that can be created or modified interactively through visual modular program. Geostatistical realizations are transferred to the execution groups of the visualization environment by message passing.
The reliability of a distributed database systems is the probability that a program which runs on multiple processing elements and needs to communicate with other processing elements for remote database will be execut...
详细信息
The reliability of a distributed database systems is the probability that a program which runs on multiple processing elements and needs to communicate with other processing elements for remote database will be executed successfully. This reliability varies according to 1) the topology of the distributed database system, 2) the reliability of the communication links, 3) the databases and program distribution among processing elements, and 4) the databases required to execute a program. This paper shows that solving this reliability problem is NP-hard even when the distributed database system is restricted to a series-parallel, a 2-tree, a tree, or a star structure. Two polynomial-time algorithms are proposed for computing the reliability of a distributed program which runs on a linear and a ring distributed database system, respectively.
Creating comprehensive simulation models can be expensive and time consuming. This paper discusses our efforts to develop a general methodology that will allow users to quickly and efficiently create high fidelity sim...
详细信息
ISBN:
(纸本)078034278X
Creating comprehensive simulation models can be expensive and time consuming. This paper discusses our efforts to develop a general methodology that will allow users to quickly and efficiently create high fidelity simulation models by linking independent model objects distributed across the Internet or enterprise intranets. The result of linking these models is a model network that can be used to evaluate the aggregate performance of the system as well as investigate the interactions and performance of the individual component models. Our approach for creating a plug-and-play model integration environment is based on the principles of object-oriented programming and distributed object computing. Drawing on advances in language and network communication technology, we continue to refine an early proof-of-concept prototype called ENVISION (ENVironment for Integrating Simulation models Interactively Over Networks). The primary objective is to create a testbed system that will help us better understand how manufacturers might actually use this type of modeling facility if it was available.
暂无评论