Results are presented showing parallel implementations of domain based preconditioners used in conjunction with a Newton-Krylov solver for calculating natural convection in a square cavity. Newton-Krylov techniques ar...
详细信息
Results are presented showing parallel implementations of domain based preconditioners used in conjunction with a Newton-Krylov solver for calculating natural convection in a square cavity. Newton-Krylov techniques are based on the use of Newton's method to linearize the discrete equations and a Krylov projection method to solve the resulting linear systems. The calculations are based on a finite volume discretization of the incompressible Navier-Stokes equations and an enemy equation in primitive variable form on a staggered grid. Viability of the Newton-Krylov technique often depends on the effectiveness of the preconditioner. Consequently, effective preconditioning can be the most CPU and memory intensive operation within the solution algorithm. For these reasons, domain decomposition based preconditioners are used because of their inherent parallelism. Results are presented for strip-wise, domain-based preconditioners on two different computational architectures: single CPU and distributedcomputing cluster. These parallel results are compared and contrasted to the use of global, Incomplete Lower-Upper (ILU) factorization type preconditioners in a serial implementation.
parallelcomputing on clusters of workstations is receiving much attention from the research community. Unfortunately, many aspects of parallelcomputing over this parallelcomputing engine is not very well understood...
详细信息
parallelcomputing on clusters of workstations is receiving much attention from the research community. Unfortunately, many aspects of parallelcomputing over this parallelcomputing engine is not very well understood. Some of these issues include the workstation architectures, the network protocols, the communication-to-computation ratio, the load balancing strategies, and the data partitioning schemes. The aim of this paper is to assess the strengths and limitations of a cluster of workstations by capturing the effects of the above issues. This has been achieved by evaluating the performance of this computing environment in the execution of a parallel ray tracing application through analytical modeling and extensive experimentation.
For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any ...
详细信息
For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work.
Recent advances in Internet connectivity and implementations of safer distributedcomputing through languages such as Java provide the foundation for transforming computing resources into tradable commodities. We have...
详细信息
Recent advances in Internet connectivity and implementations of safer distributedcomputing through languages such as Java provide the foundation for transforming computing resources into tradable commodities. We have developed Javelin, a Java-based prototype of a globally distributed heterogeneous, high-performance computational infrastructure that conveniently enables rapid execution of massively parallel applications. Our infrastructure consists of three entities: Hosts, clients, and brokers. Our goal is to allow users to buy and sell computational power using supply and demand, and market mechanisms to marshal computational power far beyond what can be achieved via conventional techniques. Several research issues must be worked out to make this vision a reality: allocating resources between computational objects via market mechanisms; expressing and enforcing scheduling and quality of service constraints; modeling programming in a global computing ecosystem; supporting heterogeneous execution without sacrificing computational speed; ensuring host security; global naming and communication; and client privacy.
The molecular mechanical potential function is widely used in molecular modeling and simulation researches. The most CPU time consuming parts of molecular mechanical potential are the nonbonding interaction terms. An ...
详细信息
The molecular mechanical potential function is widely used in molecular modeling and simulation researches. The most CPU time consuming parts of molecular mechanical potential are the nonbonding interaction terms. An efficient parallel algorithm for nonbonding energy calculation is outlined and its implementation is tested on a variety of parallel and distributed processing elements. As minimal parallel constructs are added, the current implementation does not modify nor slow down the serial algorithm. Load balancing flexible enough to accommodate local loads of each PE and optimization of list updating procedure are desired and under developments.
The multithreaded processor-called Rhamma-uses a fast context switch to bridge latencies caused by memory accesses or by synchronization operations. Load/store, synchronization, and execution operations of different t...
详细信息
The multithreaded processor-called Rhamma-uses a fast context switch to bridge latencies caused by memory accesses or by synchronization operations. Load/store, synchronization, and execution operations of different threads of control are executed simultaneously by appropriate functional units. A fast context switch is performed whenever a functional unit comes across an operation that is destined for another unit. The overall performance depends on the speed of the context switch. We present two techniques to reduce the context switch cost to at most one processor cycle: A context switch is explicitly coded in the opcode, and a context switch buffer is used. The load/store unit shows up as the principal bottleneck. We evaluate four implementation alternatives of the load/store unit to increase processor performance.
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL features ver...
详细信息
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL features very low latency for the start of communication operations and reduces the software overhead for communication specific functions. To close the gap between off-the-shelf microprocessors and the communication system a highly sophisticated processor interface implements atomic start of communication, MMU support, and a flexible event scheduling scheme. The interconnectivity of ATOLL provided by four independent network ports combined with cut-through routing allows the configuration of a large variety of network topologies. A software transparent error correction mechanism significantly reduces the required protocol overhead. The presented simulation results promise high performance and low-latency communication.
Armstrong III is a multi node multi-computer designed and built at the Laboratory for Engineering Man/Machine System (LEMS) of Brown University. Each node contains a RISC processor and reconfigurable resources impleme...
详细信息
Armstrong III is a multi node multi-computer designed and built at the Laboratory for Engineering Man/Machine System (LEMS) of Brown University. Each node contains a RISC processor and reconfigurable resources implemented with FPGAs. The primary benefit in using FPGAs is that the resulting hardware is neither rigid nor permanent but is in-circuit reprogrammable. This allows each node to be tailored to the computational requirements of an application. This paper describes the Armstrong III architecture and concludes with a substantive example application that performs HMM Training for speech recognition with the reconfigurable platform.
Two parallel sorting algorithms, GENERAL-BS and MINIMIZING-BS, which are implemented on shared-memory parallel computers, are presented in this paper. A parity strategy which gives an idea for the efficient usage of t...
详细信息
ISBN:
(纸本)0818682272
Two parallel sorting algorithms, GENERAL-BS and MINIMIZING-BS, which are implemented on shared-memory parallel computers, are presented in this paper. A parity strategy which gives an idea for the efficient usage of the local memory associated with each processor is introduced. The number of network accesses(or communications) of the algorithm MINIMIZING-BS is reduced by approximately one half compared with the algorithm GENERAL-BS. On the basis of decreasing the communication, the algorithm MINIMIZING-BS results in a significant improvement of performance.
This paper describes the integration of nested data parallelism into Fortran 90. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms o...
详细信息
This paper describes the integration of nested data parallelism into Fortran 90. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms of control parallelism, such as divide-and-conquer algorithms thus enabling the programmer to express such algorithms far more naturally. Existing work deals with nested data parallelism in a functional environment, which does help avoid a set of problems, but makes efficient implementations more complicated. Moreover functional languages are not readily accepted by programmers used to languages such as Fortran and C, which are currently predominant in programming parallel machines. In this paper, we introduce the imperative data-parallel language Fortran 90V and give an overview of its implementation.
暂无评论