The exploitation of the effectiveness of parallel computers for computational fluid dynamics (CFD) requires an efficient parallel algorithm for numerical solution of the Navier-Stokes equations. This paper discusses t...
详细信息
The exploitation of the effectiveness of parallel computers for computational fluid dynamics (CFD) requires an efficient parallel algorithm for numerical solution of the Navier-Stokes equations. This paper discusses the parallelization of an incompressible Navier-Stokes equation solver, and its implementation to the Cenju-3 parallel computer. We mainly focus our attention to the parallel solution of the discrete Poisson equations, which solution consumes the most of the computational time. Two parallel linear solvers are implemented based on a Rectangular Domain Decomposition Method (RDDM) and on the SMAC scheme. A parallel linear solver implemented is the Blocked MICCG (B-MICCG) method introduced by Washio and Hayami. The other linear solver is the Multi-Grid preconditioned Bi-CGSTAB(MG-Bi-CGSTAB) method. Numerical experiments has been conducted on Cenju-3 for up to 25 processors. Although the MG-Bi-CGSTAB method produces poorer speedup than the B-MICCG method, the MG-Bi-CGSTAB method is about 4 times faster than the B-MICCG method in terms of the total execution time.
This paper describes the design and implementation of a comprehensive operating system (COS) for highly parallel machine. This is one of the research themes in the Japanese University Highly parallel Research Project ...
详细信息
This paper describes the design and implementation of a comprehensive operating system (COS) for highly parallel machine. This is one of the research themes in the Japanese University Highly parallel Research Project supported by the Special Grant of the Research Funds of Ministry of Education. The target machine is called JUMP-1 originally developed in this project. Several design policies are discussed, and the user service classes are defined. COS architecture is based on the micro kernel, and it includes both Partition OS and Service OS. The development environment is also discussed briefly.< >
Mathematical modeling of various physical phenomena, ranging from fluid flow to electromagnetic field dynamics, from thermal patterns inside a semiconductor wafer to thermal patterns on a geophysical scale, and from c...
详细信息
Mathematical modeling of various physical phenomena, ranging from fluid flow to electromagnetic field dynamics, from thermal patterns inside a semiconductor wafer to thermal patterns on a geophysical scale, and from combustion inside an automotive cylinder to ion transport across a biological membrane, is a challenge for the 1990s and beyond. These processes share in common a basic underlying structure in that they are described by systems of partial differential equations. This paper proposes parallel processing concepts for the real time simulation of such physical phenomena. Specifically, we present a mathematic-physical engine (MPE) to solve in real time, and to display three-dimensionally, the solution of sets of ordinary or partial differential equations. In fact, different versions of the MPE are discussed ranging from a single commercial parallel machine, to networked supercomputers, to a stacked-wafer dedicated machine - to suit the needs of scientists/engineers around the world;High speed networking is proposed for remote usage of an expensive MPE facility. Such a capability can enhance the ability of scientists and engineers to accelerate basic research on the one hand, and product development on the other.
This paper investigates architectural and communication issues in designing heterogeneous parallel systems. The emerging high speed passive star-coupled optical interconnection using wavelength division multiplexing i...
详细信息
This paper investigates architectural and communication issues in designing heterogeneous parallel systems. The emerging high speed passive star-coupled optical interconnection using wavelength division multiplexing is considered as the system interconnect to provide high bandwidth communication demanded by heterogeneous systems. A representative master-slave computational model together with two different task scheduling strategies are used to evaluate and determine suitable communication strategy for such systems. It is shown that a hierarchical ALOHA-based communication strategy between the master and the slaves, implemented on top of the passive star-coupled network, leads to a considerable reduction in contention and task completion time compared to the direct communication from the slaves to the master. The time division multiplexed access (TDMA) approach has performance comparable to the hierarchical strategy. Depending on the application characteristics, the number of levels in the hierarchal scheme can be chosen to provide a balance between channel contention and overall execution time. These results provide guidelines for designing scalable heterogeneous systems in near future.
This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across p...
详细信息
ISBN:
(纸本)9780897916714
This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance *** implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally *** code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.
Communication latency is central to multiprocessor design. This report presents the design principles of EM-X multiprocessor towards tolerating communication latency. Multi-threading principle is built in the EM-X to ...
详细信息
Communication latency is central to multiprocessor design. This report presents the design principles of EM-X multiprocessor towards tolerating communication latency. Multi-threading principle is built in the EM-X to overlap communication and computation for latency tolerance. In particular, we present two types of hardware support for remote memory access: (1) priority-based packet scheduling for thread invocation, and (2) direct remote memory access mechanism. The priority-based scheduling policy extends a FIFO ordered thread invocation policy to adopt to different computational needs. The direct remote memory access based on non-preemptive thread execution is designed to overlap remote memory operations while executing threads. We give two examples to explicate our approach. The 80-processor prototype of EM-X is currently being fabricated and is expected to be operational in the near future. Preliminary evaluation indicates that the EM-X can effectively overlap computation and communication toward tolerating communication latency for high performance parallel computing.
We first present a unified formulation to three problems in VLSI physical design: Layout compaction, Wire balancing and Integrated layout compaction and wire balancing problems. The aim of layout compaction is to achi...
详细信息
We first present a unified formulation to three problems in VLSI physical design: Layout compaction, Wire balancing and Integrated layout compaction and wire balancing problems. The aim of layout compaction is to achieve minimum chip width. Whereas wire balancing seeks to achieve minimum total wire length, integrated layout compaction and wire balancing seeks to minimize wire length maintaining the chip width at the optimum value. Our formulation is in terms of the dual transshipment problem. We then review our recent work on a parallel algorithm for the dual transshipment problem. We show how this algorithm called Modified Network Dual Simplex Method provides a unified approach to solve the three problems mentioned above and present experimental results. Our implementations have been on the BBN Butterfly machine. We draw attention to certain rather unusual results and argue that if the MNDS method is used, then integrated layout compaction and wire balancing will achieve minimum chip width and a total wire length close to the optimum achieved by the wire balancing algorithm.
programming SIMD hardware to interpret (in parallel) programs and data resident in each PE is a technique for obtaining a cost effective, massively parallel MIMD processing environment. Although heavily dependent on e...
详细信息
ISBN:
(纸本)0818656026
programming SIMD hardware to interpret (in parallel) programs and data resident in each PE is a technique for obtaining a cost effective, massively parallel MIMD processing environment. Although heavily dependent on each application that is interpreted, the performance of the synthesized MIMD environment is greatly influenced by the organization of the instruction interpreter. For example, it is possible to delay the interpretation of infrequent operations to improve the overall performance of the MIMD processing environment. Interpreters that attempt improved MIMD performance by deferring infrequent operations are called variable issue control loops. This paper examines the construction of optimized variable issue control loops. In particular, we study the problem of building control loops that optimize either (i) instruction throughput or (ii) PE utilization and present three heuristic algorithms for deriving near optimal variable issue control loops.
The proceedings contain 79 papers. The special focus in this conference is on algorithms and Computation. The topics include: Quasi-random combinatorial structures;Hamiltonian abstract voronoi diagrams in linear time;...
ISBN:
(纸本)9783540583257
The proceedings contain 79 papers. The special focus in this conference is on algorithms and Computation. The topics include: Quasi-random combinatorial structures;Hamiltonian abstract voronoi diagrams in linear time;every set of disjoint line segments admits a binary tree;the traveling cameraman problem, with applications to automatic optical inspection;on determining the congruity of point sets in higher dimensions;computing solutions uniquely collapses the polynomial hierarchy;verifying the determinant in parallel;hardness of approximating graph transformation problem;randomized wait-free naming;unifying themes for network selection;robust asynchronous algorithms in networks with a fault detection ring;distributed algorithms for finding small cycle covers in arbitrary networks;a neural network model for quadratic programming with simple upper and lower bounds and its application to linear programming;analysis of the convergency of topology preserving neural networks on learning;on helping and interactive proof systems;average cost to produce partial orders;two notes on low-density subset sum algorithm;finitely representable spatial objects and efficient computation;computing all small cuts in undirected networks;edge-coloring and f-coloring for various classes of graphs;edge-disjoint paths in permutation graphs;algorithmic chernoff-hoeffding inequalities in integer programming;approximate pattern matching with samples;an efficient optimal algorithm for minimizing the overall communication cost in replicated data management;realizing degree sequences in parallel;a linear-time algorithm for computing characteristic strings;fast deterministic hot-potato routing on processor arrays and optimal routing of permutations on rings.
An extension to the Common Factor Algorithm, CFA, to compute Discrete Fourier Transform, DFT, under the condition that the size of the transform is N = M2, shows that the input and output data array of the transform m...
详细信息
An extension to the Common Factor Algorithm, CFA, to compute Discrete Fourier Transform, DFT, under the condition that the size of the transform is N = M2, shows that the input and output data array of the transform may have identical index mapping. A simple planar 2-dimensional systolic array implementation of CFA algorithm is presented. The systolic array consists of N homogeneous Processing Element, PE. A DFT of size N = M2 can be computed in 2M + 1 steps of pipelined operations, achieving the Area-time complexity AT2 = O(N2log3 N). Asymptotically sub-optimal and without the necessity of complicated index mapping and data shuffling, the proposed approach is compared favorably with other existing approaches in realistic VLSI implementation. This architecture has also very good expansibility that a 2t N-size DFT transform can be computed on 2t nearest-neighbor connected N-size array with reloaded twiddle factors, which makes it more suitable for VLSI implementation of DFT transform in various practical size.
暂无评论