The study of Regular Iterative algorithms (RIAs), which was introduced in a seminal paper by Krap, Miller, and Winograd in 1967, forms the basis for systematic design and analysis of regular processor arrays, includin...
详细信息
ISBN:
(纸本)0897913701
The study of Regular Iterative algorithms (RIAs), which was introduced in a seminal paper by Krap, Miller, and Winograd in 1967, forms the basis for systematic design and analysis of regular processor arrays, including the class of systolic arrays. The RIAs have also been studied under different contexts and different names (on the last count RIAs were reintroduced, as late as 1987, under the name of dynamic graphs). In spite of the interest such algorithms have received over the years, many important issues that were left unresolved in the original paper by Karp et al., have remained unanswered. In this paper we answer many such questions, particularly those relating to parallel scheduling and implementation of RIAs. Based on the analysis of a simple graph that captures the dependence structure of a given RIA, we are able to determine linear subspaces in the index space of a given RIA such that all variables lying on the same subspace can be scheduled at the same time;this generalizes the so-called hyperplanar scheduling which was shown by Karp et. al. to work for only a subclass of RIAs. This geometric scheduling scheme is shown to be asymptotically optimal and is used to completely characterize the extent of parallelism in any RIA. Moreover, we develop procedures to determine explicit schedules (i.e, a closed form expression for the schedule of every computation in the algorithm) that correspond to the geometric schedules, and also show that every RIA can be automatically mapped onto regular processor arrays.
The Viterbi algorithm (VA) is a common application of dynamic programming. Since it contains a nonlinear ACS (add-compare-select) feedback loop, this loop is the bottleneck in high-data-rate implementations. It is sho...
详细信息
The Viterbi algorithm (VA) is a common application of dynamic programming. Since it contains a nonlinear ACS (add-compare-select) feedback loop, this loop is the bottleneck in high-data-rate implementations. It is shown that, asymptotically, the ACS feedback no longer has to be processed recursively, i.e., there is no feedback, resulting in negligible performance loss. This can be exploited to derive purely feedforward architectures for Viterbi decoding, so that a modular cascadable implementation results. By designing one cascadable module, any speedup can be achieved simply by adding modules to the implementation. It is shown that optimization criteria, e.g., minimum latency or maximum hardware efficiency, are met by very different architectures.
Efficient, numerically stable and highly concurrent algorithms for adaptive multichannel least-squares lattice filters are presented. They require O(pm2) orthogonal operations to update the linear prediction and estim...
详细信息
Efficient, numerically stable and highly concurrent algorithms for adaptive multichannel least-squares lattice filters are presented. They require O(pm2) orthogonal operations to update the linear prediction and estimation errors as well as all reflection coefficients for a m-channel and p-tap problem. parallel array architectures for high-speed implementations are discussed.
If p(x) and q(x) are polynomials of degree n and n - 1, respectively, having real interlaced zeros, all the coefficients of the polynomials generated by the Euclidean scheme applied to p(x) and q(x) can be computed by...
详细信息
ISBN:
(纸本)0897913701
If p(x) and q(x) are polynomials of degree n and n - 1, respectively, having real interlaced zeros, all the coefficients of the polynomials generated by the Euclidean scheme applied to p(x) and q(x) can be computed by using O(log3 n) parallel arithmetic steps and n2 processors. The number of arithmetic operations is within a polylogarithmic factor from the straightforward lower bound. This result can be applied to the solution of the inverse eigen-value problem for Jacobi matrices and can be extended to any pair of polynomials provided that the Euclidean scheme is carried out in n steps. The algorithm is obtained by reducing the Euclidean scheme computation to the evaluation of the Cholesky factorization of a positive definite Hankel matrix. The same reduction can be used to compute the GCD of two general polynomials by evaluating a vector of minimum 'degree' in the kernel of a Hankel matrix. This new characterization of the GCD yields an algorithm for its computation in O(log2 n) parallel steps using n2 processors.
We present the chaos router, an asynchronous adaptive router, which under certain circumstances can send messages farther from their destinations. The chaos router greatly simplifies the routing logic by removing the ...
详细信息
ISBN:
(纸本)0897913701
We present the chaos router, an asynchronous adaptive router, which under certain circumstances can send messages farther from their destinations. The chaos router greatly simplifies the routing logic by removing the livelock protection of previous schemes. Through an effective use of randomness, whose sources include that due to the adaptively processed load, the natural timing differences of selftimed circuitry and explicitly injected randomization, the chaos router avoids long message routes with high probability. In this paper the router is described, it is argued that the chaos router is deadlock free and probabilistically livelock and starvation free, and simulation results are presented showing that the chaos router performs well.
Methods for implementing 2-D reduced-update Kalman filtering using parallel machines are described. Various types of parallelarchitectures can be used for the implementation. algorithms are described and compared for...
详细信息
ISBN:
(纸本)0818620536
Methods for implementing 2-D reduced-update Kalman filtering using parallel machines are described. Various types of parallelarchitectures can be used for the implementation. algorithms are described and compared for implementation on the Sequent Balance 21, a multiprocessor system with shared memory, CLIP4, a 96 × 96 SIMD processor array, and the Connection Machine, an SIMD array of 64K processors. All the machines show great improvement in performance. The advantage with the Connection Machine is the individual processor's ability to write and read from different locations. This allows multiple-image filtering and a great increase in the efficiency of processor usage.
We examine the issue of running algorithms with a constant factor slowdown on a faulty hypercube in a worst case scenario. We present two set of novel results related to this issue. We first consider edge faults and s...
详细信息
ISBN:
(纸本)0897913701
We examine the issue of running algorithms with a constant factor slowdown on a faulty hypercube in a worst case scenario. We present two set of novel results related to this issue. We first consider edge faults and show how to tolerate faults with a constant factor slowdown in communication and no slowdown in computation. The key to our approach is an efficient method for embedding a fault free Cube Connected Cycles (CCC) graph in the faulty hypercube. Using this embedding we can run ascend-descend algorithms (such as bitonic sort) on the faulty hypercube by implementing them on the embedded CCC. We then consider hypercubes with both edge and node faults. We prove that for any constant c there exists a fault-free subgraph of an n-dimensional hypercube with nc faulty components that can implement a large class of hypercube algorithms with only a constant factor slowdown. To the best of our knowledge, this result is the first in which a hypercube can tolerate more than O(n) faults in the worst case sense.
This paper examines the problem of constructing disjoint paths through a 3-dimensional grid in order to connect specified origin/destination terminal pairs so as to minimize the width of the grid. Specifically, we pre...
详细信息
ISBN:
(纸本)0897913701
This paper examines the problem of constructing disjoint paths through a 3-dimensional grid in order to connect specified origin/destination terminal pairs so as to minimize the width of the grid. Specifically, we present an algorithm which uses at most d/(L - p) + O(p√d(L - p)) rows, where L is the number of layers, p is the number of terminals per stack, and d is the problem density. This problem has applications in several areas of VLSI computation: it can be viewed as a model for 3-dimensional channel routing, and it can be applied to the problem of packet routing through mesh-connected arrays with no queues.
The authors present algorithms for relational database operations using mesh connected processors. An important feature of these algorithms is that they are fault tolerant, that is, errors in computation that may aris...
详细信息
The authors present algorithms for relational database operations using mesh connected processors. An important feature of these algorithms is that they are fault tolerant, that is, errors in computation that may arise owing to either permanent or transient failure in a single processor are detected and corrected. These algorithms are immediately useful for main memory databases. Data elements for mesh connected processors can be fed from main memory.
The problem of developing scalable and near-optimal algorithms for orthogonal shared-memory multiprocessing systems with a multidimensional access (MDA) memory array is considered. An orthogonal shared-memory system c...
详细信息
ISBN:
(纸本)0818620536
The problem of developing scalable and near-optimal algorithms for orthogonal shared-memory multiprocessing systems with a multidimensional access (MDA) memory array is considered. An orthogonal shared-memory system consists of 2n processors and 2m memory modules accessed in any one of m possible access modes. Data stored in memory modules are available to processors under a mapping rule that allows conflict-free data reads and writes for any given access mode. Scalable algorithms are presented for two well-known computational problems, namely, matrix multiplication and the fast Fourier transform (FFT). A complete analysis of the algorithms based on computational time and the access modes needed is also presented. The algorithms scale very well onto higher dimensional MDA architectures but are not always optimal. This reveals a tradeoff between the scalability of an algorithm and its optimality in the MDA computational model.
暂无评论