A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a conve...
详细信息
ISBN:
(纸本)0818656026
A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: (i) the introduction of process groups, (ii) the definition of semantics that ensures correctness, and (iii) the design of new and tunable algorithms based on a realistic point-to-point communications model.
this paper presents a summary of the delivery of C-DAC mission in terms of PARAM parallel supercomputer, its architecture, programming environment and applications. A brief preview is also given of C-DAC's Second ...
详细信息
ISBN:
(纸本)0818656026
this paper presents a summary of the delivery of C-DAC mission in terms of PARAM parallel supercomputer, its architecture, programming environment and applications. A brief preview is also given of C-DAC's Second Mission to deliver a teraflops architecture Massively parallel Supercomputer by 1997-98.
In this paper we present an improved search procedure for the General Parameter Method (GPM) [1]. Our procedure maps uniform dependence algorithms to application-specific processor arrays (PAs). It can optimize genera...
详细信息
ISBN:
(纸本)0818656026
In this paper we present an improved search procedure for the General Parameter Method (GPM) [1]. Our procedure maps uniform dependence algorithms to application-specific processor arrays (PAs). It can optimize general design objectives with certain non-monotonicity properties, i.e., those that do not increase monotonically withthe parameters. An example of such an objective is the minimization of the total completion time, including load and drain times. In contrast, earlier design methods can only deal with monotonic objectives. We present results for the matrix-product problem using this search technique;application of the technique on the transitive-closure problem is presented elsewhere [2]. We also show that the parameters in GPM can be expressed in terms of the schedule vector II and allocation matrix S in the popular dependence-based method (DM), thereby allowing GPM to be used in DM for finding optimal designs for uniform dependence algorithms.
the extended GCD algorithm is very useful for data dependence tests, for example, the Power Test, on supercomputers. this paper parallelizes the extended GCD algorithm on a CREW SM MIMD computer with O(n) processors. ...
详细信息
ISBN:
(纸本)0818656026
the extended GCD algorithm is very useful for data dependence tests, for example, the Power Test, on supercomputers. this paper parallelizes the extended GCD algorithm on a CREW SM MIMD computer with O(n) processors. At first, we improve the sequential extended GCD algorithm. then, we parallelize the extended GCD algorithm by two methods. First, we parallelize to triangularize the matrix by reducing elements in the same column simultaneously. this algorithm almost has no algorithmic redundancy, but has the efficiency O(1/ log2n) where n is the number of variables. Second, some rows, which have been reduced at the current column in the above algorithm, can be reduced at the next column immediately. this algorithms has the efficiency O(min(m, n-1(/n) where m is the number of linear diophantine equations, and is powerful when m is large, when m is equal to n-1, the efficiency of the algorithm is O(1).
this paper presents the results of running the some benchmarks from the Genesis suite on the Transtech Paramid. the benchmarks use the PARMACS parallelprocessing standard, and are based on applications in the fields ...
详细信息
ISBN:
(纸本)0818656026
this paper presents the results of running the some benchmarks from the Genesis suite on the Transtech Paramid. the benchmarks use the PARMACS parallelprocessing standard, and are based on applications in the fields of general relativity, molecular dynamics and QCD. the Paramid is a distributed memory parallel computer, using up to 64 Intel i860-XP processors. the results demonstrate good parallel performance, and the ability of the machine to run standard portable software.
Scalable parallel computer architectures provide the computational performance demanded by advanced biological computing problems. NIH has developed a number of parallelalgorithms and techniques useful in determining...
详细信息
Scalable parallel computer architectures provide the computational performance demanded by advanced biological computing problems. NIH has developed a number of parallelalgorithms and techniques useful in determining biological structure and function. these applications include processing electron micrographs to determine the three-dimensional structure of viruses, calculating the solvent accessible surface area of proteins to predict the three-dimensional conformation of these molecules from their primary structure, and searching for homologous DNA sequences in large genetic databases. Timing results demonstrate substantial performance improvements withparallel implementations compared with conventional sequential systems.
Performance analysis of concurrent executions in parallel systems has been recognized as a challenging problem. the aim of this research is to study approximate solution techniques for this problem. We model the struc...
详细信息
ISBN:
(纸本)0818656026
Performance analysis of concurrent executions in parallel systems has been recognized as a challenging problem. the aim of this research is to study approximate solution techniques for this problem. We model the structure of the parallel machine and the structure of the jobs executing on such a system. WE investigate rich classes of jobs, which can be expressed by sequential, And-Fork/ And-join, Or-Fork/OR-Join, and probabilistic-Fork. We propose an efficient performance prediction technique for these classes of jobs.
In this paper we present a new mapping strategy of the dynamic space warping algorithm (DSWA) onto our micro-grained array processor (MGAP). this new mapping strategy reduces the communication complexity between proce...
详细信息
ISBN:
(纸本)0818656026
In this paper we present a new mapping strategy of the dynamic space warping algorithm (DSWA) onto our micro-grained array processor (MGAP). this new mapping strategy reduces the communication complexity between processing elements and increases the performance due to data pipelining and interleaving. the DSWA, which can be applied to image recognition, originally needs a four-dimensional array. Practically, however, this four-dimensional algorithm must be mapped onto a two-dimensional array processor. A previous mapping used O(NW) processors to compute the distance between an N × N input image and a reference image withthe warping distance W in O(NW) time. Our new mapping scheme uses O(N)2 processors to generate each computation result in O(N+W2) time. We also show the experimental results and performance comparison between Connection Machine (CM) 200 and the MGAP.
Previous work on parallel database systems has paid little attention to the interaction of asynchronous disk prefetching and processor parallelism. this paper investigates this issue for scan operations on shared-memo...
详细信息
ISBN:
(纸本)0818656026
Previous work on parallel database systems has paid little attention to the interaction of asynchronous disk prefetching and processor parallelism. this paper investigates this issue for scan operations on shared-memory multi-processors. Two heuristic methods are developed for the allocation of processors and memory to optimize either the speedup or the benefit/cost ratio of database scan operations. the speedup optimization balances the data production rate of the disks and the data consumption rate of the processors, aiming at optimal speedup while ensuring that resources are not allocated unnecessarily.
the thOREAU simulation of vehicular traffic on city streets and freeways, developed by the MITRE Corporation, has been adapted to run in parallel on a network of Unix workstations connected by ethernet. Tenfold and la...
详细信息
ISBN:
(纸本)0818656026
the thOREAU simulation of vehicular traffic on city streets and freeways, developed by the MITRE Corporation, has been adapted to run in parallel on a network of Unix workstations connected by ethernet. Tenfold and larger speedups were observed by running as many as 40 parallelthreads on 34 processors. the performance curves shows little sign of leveling off with higher degrees of parallelism, which may mean that further gains can be obtained with more processors.
暂无评论