The problems of measuring the performance of a highly parallel multiple processor system, such as the 4096 element ICL Distributed Array Processor are presented in relation to the conventional methods used for serial ...
详细信息
The problems of measuring the performance of a highly parallel multiple processor system, such as the 4096 element ICL Distributed Array Processor are presented in relation to the conventional methods used for serial processors; this is preceded by a brief description of the DAP hardware in order to. provide a framework for the discussion, together with some of the resulting implications for algorithm design. The importance of choosing algorithms for parallel computation in such a way as to make the best use of the parallelism of the hardware for the problem to be solved is discussed, and examples are given of parallel and hybrid algorithms—in the latter a mixture of serial and parallel techniques are used. A method of comparison of performance at the problem solving level is presented, which is illustrated by results obtained by DAP users studying problems which arise in a wide range of application areas.
A large-network algorithm solves a problem of size N on a network of N processors. We present a method for transforming certain large networks into quotient networks that emulate those large networks with fewer proces...
详细信息
A large-network algorithm solves a problem of size N on a network of N processors. We present a method for transforming certain large networks into quotient networks that emulate those large networks with fewer processors. Large-network algorithms are easily modified to execute on the quotient network. The emulations result in no loss in execution efficiency. Quotient networks allow algorithms to be designed assuming any number of processors and executed efficiently at a great savings in hardware cost.
Fast Fourier transform (FFT) algorithms for single instruction multiple data (simd) machines are developed which simultaneously solve any combination of FFTs of different sizes and even different spatial dimensionalit...
详细信息
Fast Fourier transform (FFT) algorithms for single instruction multiple data (simd) machines are developed which simultaneously solve any combination of FFTs of different sizes and even different spatial dimensionalities. The only restrictions are that all the periods must be powers of two and that the initial data must satisfy some alignment requirements with the address space in the computer. The degree of parallelism is equal to the sum of the sizes of all the subproblems and the (parallel) solution time is proportional to log2(m), where m is the number of points in the largest subsystem. It is shown that the task of unscrambling the data can be both executed and scheduled efficiently in parallel. Finally, implementations on the MasPar computer are described. The codes can be quickly and easily employed in solving complicated problems, and the interface for the routines may therefore be interesting for sequential FFT codes as well.
Using a prime number N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N. To reject the use of a prime (or o...
详细信息
ISBN:
(纸本)0818638109
Using a prime number N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N. To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division. Massively parallel simd computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty;the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors from memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.
simd arrays are likely to become increasingly important as coprocessors in domain specific systems as architects continue to leverage RAM technology in their design. The problem this work addresses is the evaluation o...
详细信息
ISBN:
(纸本)0769500870
simd arrays are likely to become increasingly important as coprocessors in domain specific systems as architects continue to leverage RAM technology in their design. The problem this work addresses is the evaluation of simd arrays with respect to complex applications while accounting for operating frequency and chip area. The method we use is to bridge the gap between architecture-level and EDA-level modeling by using an EDA-based tool to calibrate architectural simulations. The resulting system retains much of the high throughput of the architecture-level simulator but also has accuracy similar to that of an early pass EDA synthesis and circuit simulation. We have used our system to evaluate hundreds of potential simd array designs with respect to real applications. Some of the results were surprising: the slowdown caused by underutilized resources was significantly more than we had anticipated.
We discuss implementation of additive Schwarz type algorithms on simd computers. A recursive, additive algorithm is compared with a two-level scheme. These methods are based on a subdivision of the domain into thousan...
详细信息
ISBN:
(纸本)0898712882
We discuss implementation of additive Schwarz type algorithms on simd computers. A recursive, additive algorithm is compared with a two-level scheme. These methods are based on a subdivision of the domain into thousands of micro-patches that can reflect local properties, coupled with a coarser, global discretization where the `macro' behavior is reflected. The two-level method shows very promising flexibility, convergence and performance properties when implemented on a massively parallel simd computer.
暂无评论