Simulation is an application area for which high speed computation is critical. Massively parallel computers have appeared so that it is now possible to execute very large-scale and complicated simulation without sacr...
详细信息
Simulation is an application area for which high speed computation is critical. Massively parallel computers have appeared so that it is now possible to execute very large-scale and complicated simulation without sacrificing accuracy and simplifying problems. For the kind of problems in which the computation in each element depends on the data of all the other elements, a completely-connected network is required in order to simulate with high efficiency. However in massively parallel computers, efficient simulation of these problems is difficult to realize. Its cause is due to their network structure. In this paper, a new network topology which solves these problems with high efficiency and a computation method based on the topology are described.
Wormhole routing is a key technique in the design of Dawning 1000 which is the first MPP system made in China. In this paper, wormhole routing is introduced, and an algorithm based on wormhole routing, a chip architec...
详细信息
ISBN:
(纸本)0818674601
Wormhole routing is a key technique in the design of Dawning 1000 which is the first MPP system made in China. In this paper, wormhole routing is introduced, and an algorithm based on wormhole routing, a chip architecture and its logic design are also described. Finally, an interconnection constructed by the wormhole routing chips is put forward for Dawning 1000 which indicates this kind of chip has a correct function and fast rate with reliability.
In VLIW (Very Long Installation Word) compilers, one of the most important issues is how to handle conditional branches, because control dependences are caused by conditional branches and limit the scope of scheduling...
详细信息
ISBN:
(纸本)0818674601
In VLIW (Very Long Installation Word) compilers, one of the most important issues is how to handle conditional branches, because control dependences are caused by conditional branches and limit the scope of scheduling. This paper proposes the efficient method of eliminating conditional branches. We use SSA (Static Single Assignment) information for preserving semantics. By using our methods, global scheduling techniques can be processed more efficiently and simply. We utilize /spl phi/-functions aggressively, thus computations for code motion are not required. We don't need complex hardware support. Our scheme also makes the performance independent on the result of branch outcomes.
The paper introduces a programming environment for HPF like languages with emphasis on graphical support for data distribution. A novel component of this environment is a mapping design and visualization tool. The too...
详细信息
The paper introduces a programming environment for HPF like languages with emphasis on graphical support for data distribution. A novel component of this environment is a mapping design and visualization tool. The tool provides visualization of HPF array objects such as data arrays and logical processor arrays and creates a number of diagrams based on information that is gathered from other components of the environment such as the compiler or a debugger. The diagrams relate to crucial issues such as load distribution and communication. Furthermore we show how our environment facilitates seamless integration of additional components.
The doubly-linked list (DLL) protocol provides a memory efficient, scalable, high-performance and yet easy to implement method to maintain memory coherence in distributed shared memory (DSM) systems. In this paper, th...
详细信息
The doubly-linked list (DLL) protocol provides a memory efficient, scalable, high-performance and yet easy to implement method to maintain memory coherence in distributed shared memory (DSM) systems. In this paper, the performance analysis of the DLL family of protocols is presented. Theoretically, the DLL protocol with stable owners has the shortest remote memory access latency among the DLL protocol family. According to the simulated performance evaluation, the DLL-S protocol is 65.7% faster than the DDM algorithm for the linear equation solver; and is 16.5% faster for the matrix multiplier. From the trend of the performance figures, it is predicted that the improvement in performance due to the DLL-S protocol will be considerably greater when a larger number of processors are used, indicating that the DLL-S protocol is also the most scalable of the protocols tested.
This paper presents a multithreaded superpipelined superscalar processor design. It is expected to have a sustained rate of 5.4 instructions run per cycle, with 4 threads on chip. Multithreading serves to improve the ...
详细信息
This paper presents a multithreaded superpipelined superscalar processor design. It is expected to have a sustained rate of 5.4 instructions run per cycle, with 4 threads on chip. Multithreading serves to improve the superscalar CPI by interleaving threads executions. Operator sharing is used instead of out of order execution. It requires less hardware-no reservation stations, collision vectors or renamed registers-and should offer a greater parallelism potential. Arithmetic operators, including adders, shifters, a multiplier and a step divider, have been pipelined to reduce the processor cycle width to a 16 bits adder propagation delay. Separate and equal lengths data paths controlled by a completely RISC instruction set allow efficient in order issue and termination. Floating point operations are emulated with integer ones with data dependent algorithms providing as good latencies as for traditional hardware implementation. A single register file serves for both the integer and the floating point data.
Evaluates the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate-gradient iterative matrix-solvers on high-performance computing and communications (HPCC) plat...
详细信息
ISBN:
(纸本)9780818675829
Evaluates the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate-gradient iterative matrix-solvers on high-performance computing and communications (HPCC) platforms. We discuss the use of intrinsic functions, data distribution directives and explicitly parallel constructs to optimize performance by minimizing communications requirements in a portable manner. We focus on implementations using the existing HPF definitions but also discuss issues arising that may influence a revised definition for HPF-2. Some of the codes discussed are available on the World Wide Web at http://***/hpfa/, along with other educational and discussion material related to applications in HPF.
The distance transform (DT) and the medial axis transform (MAT) are two important image operations. They are both used to extract the information about the shape and the position of the foreground pixels relative to e...
详细信息
ISBN:
(纸本)0818674601
The distance transform (DT) and the medial axis transform (MAT) are two important image operations. They are both used to extract the information about the shape and the position of the foreground pixels relative to each other. Many applications of these transforms are applied in the fields of image processing and computer vision, such as expanding, shrinking, thinning and computing shape factor, etc. Each of these two transforms is essentially a global operation. Unless the digital image is very small, all global operations are prohibitively costly. In order to provide the efficient transform computations, it is considerably desired to develop parallel algorithms for these two operations. In this paper, we provide the fastest parallel algorithms to compute the chessboard distance transform (CDT) which is a DT based on the chessboard metrics, and the medial axis transform (MAT). Each of the transforms of a 2-D binary image array of size N/spl times/N can be computed in O(1) time on the 2-D 2N/spl times/2N reconfigurable array of processors (RAP).
In this paper, we present an implementation of FEM solver on heterogeneous parallel environment that consists of two different parallel computers (the Fujitsu AP1000 and NEC Cenju-3). We used OLU (On Line University) ...
详细信息
In this paper, we present an implementation of FEM solver on heterogeneous parallel environment that consists of two different parallel computers (the Fujitsu AP1000 and NEC Cenju-3). We used OLU (On Line University) Network, which is one of the wide area ATM network connecting over 20 universities and research facilities all over Japan. We used substructure method applied in multiple levels for the calculation algorithm. This algorithm, has small data dependency in the calculation phase and the number of data transfer between the parallel computers is limited, thus the overhead for the synchronization can be smaller than iterative method, This paper also refers to the parallel triangular mesh generator based on the Delaunay Triangulation currently being implemented on the Fujitsu AP1000 parallel computer.
Discusses the advantages of computing with heterogeneous parallel machines, and examines the research challenges for automating the use of such systems. One type of heterogeneous computing system is a mixed-mode machi...
详细信息
Discusses the advantages of computing with heterogeneous parallel machines, and examines the research challenges for automating the use of such systems. One type of heterogeneous computing system is a mixed-mode machine, where a single machine can operate in different modes of parallelism. Another is a mixed-machine system, where a suite of different kinds of parallel machines are interconnected by high-speed links. To exploit such systems, a task must be decomposed into subtasks, where each subtask is computationally homogeneous. The subtasks are then assigned to and executed with the machines (or modes) that will result in a minimal overall execution time. Typically, users must specify this decomposition and assignment. One long-term pursuit in heterogeneous computing is to do this automatically. An overview of a conceptual model of what this involves is given. As an example of the research in this area, a genetic-algorithm-based approach to the subtask assignment and scheduling problem is explored. Open problems in heterogeneous computing are described.
暂无评论