Communication latency is central to multiprocessor design. This report presents the design principles of EM-X multiprocessor towards tolerating communication latency. Multi-threading principle is built in the EM-X to ...
详细信息
Communication latency is central to multiprocessor design. This report presents the design principles of EM-X multiprocessor towards tolerating communication latency. Multi-threading principle is built in the EM-X to overlap communication and computation for latency tolerance. In particular, we present two types of hardware support for remote memory access: (1) priority-based packet scheduling for thread invocation, and (2) direct remote memory access mechanism. The priority-based scheduling policy extends a FIFO ordered thread invocation policy to adapt to different computational needs. The direct remote memory access based on non-preemptive thread execution is designed to overlap remote memory operations while executing threads. We give two examples to explain our approach. The 80-processor prototype of EM-X is currently being fabricated and is expected to be operational in the near future. Preliminary evaluation indicates that the EM-X can effectively overlap computation and communication, toward tolerating communication latency for high performance parallel computing.< >
We present deterministic algorithms for k-k routing and k-k sorting on circular processor arrays with bidirectional connections. We distinguish between cases where k
We present deterministic algorithms for k-k routing and k-k sorting on circular processor arrays with bidirectional connections. We distinguish between cases where k<4, 4/spl les/k >
In distributed memory parallel machines, data access times can vary greatly depending on data location. This makes locality considerations important for improving performance. Switching locality is a special kind of l...
详细信息
In distributed memory parallel machines, data access times can vary greatly depending on data location. This makes locality considerations important for improving performance. Switching locality is a special kind of locality which conventional networks fail to exploit fully. It refers to the phenomenon in which each computation entity in a parallel application switches most of its communication between a small set of other entities. Furthermore, the membership of these sets changes infrequently. Switching locality arises naturally in many parallel applications. The Interconnection Cached Network (ICN) is a reconfigurable network especially well suited to exploiting this locality. For applications with sufficient switching locality, appropriate choices of topology and mapping in the ICN ensure that no communication request passes through more than two switches. Short communication paths reduce propagation delays and network congestion; resulting in better overall performance. In comparison, other networks are less effective in meeting these objectives. We corroborate our stand by simulating the operation of the ICN, a multi-stage interconnection network and a 2-D Mesh network on communication graphs derived from computations on unstructured grids and sparse matrices.< >
parallel algorithms developed for CAD problems today suffer from three important drawbacks. First, they are machine specific and tend to perform poorly on architectures other than the one for which they were designed....
详细信息
parallel algorithms developed for CAD problems today suffer from three important drawbacks. First, they are machine specific and tend to perform poorly on architectures other than the one for which they were designed. Second, they cannot use the latest advances in improved versions of the sequential algorithms for solving the problem. Third, the quality of results degrade significantly during parallel execution. We address these three problems for an important CAD application: standard cell placement. We have developed a parallel placement algorithm that is portable across a range of MIMD parallelarchitectures. The algorithm is part of the ProperCAD project which allows the development and implementation of a parallel algorithm such that it can be executed on a wide variety of parallel machines without any change to the source. The parallel placement algorithm is based on an existing implementation of the sequential simulated annealing algorithm, TimberWolfSC 6.0 (C. Sechen and A. Sangiovanni-Vincentelli, 1985).< >
The complete exchange (or all-to-all personalized) communication pattern occurs frequently in many important parallel computing applications. We discuss several algorithms to perform complete exchange on a two dimensi...
详细信息
The complete exchange (or all-to-all personalized) communication pattern occurs frequently in many important parallel computing applications. We discuss several algorithms to perform complete exchange on a two dimensional mesh connected computer with wormhole routing. We propose algorithms for both power-of-two and non power-of-two meshes as well as an algorithm which works for any arbitrary mesh. We have developed analytical models to estimate the performance of the algorithms on the basis of system parameters. These models take into account the effects of link contention and other characteristics of the communication system. Performance results on the Intel Touchstone Delta are presented and analyzed.< >
We consider the problems of sorting and routing on some unbounded interconnection networks, namely hypercube and de Bruijn network. We first present two efficient implementations of quicksort on the hypercube. The fir...
详细信息
We consider the problems of sorting and routing on some unbounded interconnection networks, namely hypercube and de Bruijn network. We first present two efficient implementations of quicksort on the hypercube. The first algorithm sorts N items on an N-node hypercube, one item per node, in O((log/sup 2/ N)/(log log N)) time with high probability, while the other one sorts N items on an (N/log N)-node hypercube, log N items per node, in O(log/sup 2/ N) time with high probability, which achieves optimal speedup in the sense of PT product. Both algorithms beat the fastest previous quicksort that runs in O(log/sup 2/ N) expected time on a butterfly of N nodes. We also present a deterministic (nonoblivious) permutation routing algorithm which runs in O(d/spl middot/n/sup 2/) time on a d-ary de Bruijn network of N=d/sup n/ nodes. To the best of our knowledge, this routing algorithm is so far the fastest deterministic one for the de Bruijn network of arbitrary degree. The best previous one runs in O((log d)/spl middot/d/spl middot/n/sup 2/) time. All algorithms presented are simple, the constants hidden behind the big Oh being small.< >
We consider the educational process with respect to parallel computing. Education in this area is provided by national supercomputing centers, a variety of manufacturers of parallel machines, within companies that use...
详细信息
We consider the educational process with respect to parallel computing. Education in this area is provided by national supercomputing centers, a variety of manufacturers of parallel machines, within companies that use parallel machines, as well as by colleges and universities for both undergraduate and graduate students. The panel evaluates the current system, and debate potential (realistic) improvements that could be implemented in the short-term.< >
This paper gives an O(lg/sup 3/ n)-time and n/lg n processor algorithm for solving the matrix chain ordering problem and for finding optimal triangulations of a convex polygon on the common CRCW PRAM model. This algor...
详细信息
This paper gives an O(lg/sup 3/ n)-time and n/lg n processor algorithm for solving the matrix chain ordering problem and for finding optimal triangulations of a convex polygon on the common CRCW PRAM model. This algorithm works by finding shortest paths in special digraphs modeling dynamic programming tables. Also, a key part of the algorithm is improved by computing row minima of a totally monotone matrix, this lets the algorithm run in O(lg/sup 2/ n) time with n processors on the EREW PRAM or even O(lg/sup 2/ n lg lg n) time with n/lg lg n processors on the CRCW PRAM.< >
We compare various models of parallel machines and show that they can be classified in two classes: algorithm oriented or execution oriented. None of them are really satisfying from the user's point of view. Hence...
详细信息
We compare various models of parallel machines and show that they can be classified in two classes: algorithm oriented or execution oriented. None of them are really satisfying from the user's point of view. Hence bridging models have been proposed. Contrarily to what is done in sequential where a two-level decomposition is used (programmimg-compiling), we assert that a parallelprogramming methodology must be based on a three-level decomposition. We define the notion of algorithms which scales on a distributed memory parallel computer. We propose such a methodology and advocate its advantages. Then we point out the main difficulties in parallelprogramming.< >
A network of supercomputers and high-performance workstations appears to be the only reasonable way to provide adequate computing resources for the Grand Challenge problems of the next century. Such a collection of co...
详细信息
A network of supercomputers and high-performance workstations appears to be the only reasonable way to provide adequate computing resources for the Grand Challenge problems of the next century. Such a collection of computers and supporting software environments is called a virtual computing environment (VCE). The paper describes the motivation and goals of the VCE project, followed by a description of the system. The paper concentrates on the runtime aspects of the VCE, and concludes with a discussion of a small prototype system that has been built using the Isis distributed toolkit.< >
暂无评论