Two issues are introduced into the scheduling problem: each of the sub-tasks can be performed using one of a number of parallel algorithms running on different configurations; and there is an overhead time to move dat...
详细信息
Two issues are introduced into the scheduling problem: each of the sub-tasks can be performed using one of a number of parallel algorithms running on different configurations; and there is an overhead time to move data (reconfiguration time) from the configuration required by one algorithm to the configuration required by the algorithm of the successor sub-task. The execution time for each algorithm and the reconfiguration times between all pairs of algorithms are provided along with the precedence graph. The problem is to find for each sub-task the algorithm to be executed such that the total system time, which is the sum of the execution times and the data movement times of all the selected algorithms, is minimized. The authors show that the problem is NP-hard even when the total degree, of each node of the precedence graph is at most three. They show that it can be solved in polynomial time if the total degree of each node is at most two, thus completely closing the gap between the polynomially solvable cases and the NP-hard cases in terms of the maximum degree of the precedence graph.< >
The authors examine the problem of performing a join involving nested relations in a parallel shared-everything environment. They show the difference between joining flat relations and joining nested relations, and th...
详细信息
The authors examine the problem of performing a join involving nested relations in a parallel shared-everything environment. They show the difference between joining flat relations and joining nested relations, and then develop hash-based parallel join algorithms. Both IO and CPU parallelism are addressed. Preliminary experimental results are presented.< >
An application-specific multiprocessor system is investigated for real-time implementation of the hierarchical block matching algorithm. The proposed architecture is based on parallel processing units and local memori...
详细信息
An application-specific multiprocessor system is investigated for real-time implementation of the hierarchical block matching algorithm. The proposed architecture is based on parallel processing units and local memories which are globally preloaded via a common bus. The performance is estimated for the data transfer and the parallel computation time schedule.< >
A method of implementing neural networks on programmable, parallel machines is presented. The method is applicable to multilayer connectionist networks and two dimensional, single-instruction multiple-data stream proc...
详细信息
A method of implementing neural networks on programmable, parallel machines is presented. The method is applicable to multilayer connectionist networks and two dimensional, single-instruction multiple-data stream processor arrays. A detailed description for a mapping of a multilayer perceptron with a back-propagation learning algorithm is provided. The mapping includes partitioning of inputs larger than the processor array. The performance of the method is evaluated using the Nettalk network, and is compared to that of other methods. In particular, it is shown that the implementation of the method on the Hughes Systolic/Cellular machine results in a processing rate equal to 100 million connections per second (MCPS).< >
A unified algebraic basis for transforming algorithms to achieve unlimited parallelism is provided. In the case of recursive algorithms, a certain class of algebraic structures is shown to be sufficient for the applic...
详细信息
A unified algebraic basis for transforming algorithms to achieve unlimited parallelism is provided. In the case of recursive algorithms, a certain class of algebraic structures is shown to be sufficient for the application of the look-ahead computation. This approach is generalized to matrix-based algorithms where the operations form a semiring. Previously known approaches to speeding up recursions are shown to be special instances of the given approach. A block processing realization corresponding to the logarithmic look-ahead computation is given which is efficient in its area and time requirements.< >
The authors show how the data parallelprogramming model can be applied to high level vision tasks needed for object recognition. The architecture and programming model of the Connection Machine System are reviewed. U...
详细信息
The authors show how the data parallelprogramming model can be applied to high level vision tasks needed for object recognition. The architecture and programming model of the Connection Machine System are reviewed. Utilities for representing and manipulating sets of data, the primary representation outside of the image plane, are described using communications primitives, especially segmented scans. Several algorithms for matching and evidence accumulation, which are constructed from the utilities, are compared. The techniques emphasize the use of sorting and sparse representations of space in order to limit the combinatorial processing requirements of high-level vision.< >
The QR algorithm has proven to be the most successful sequential algorithm for finding all eigenvalues of a dense nonsymmetric matrix. Recently, several parallel implementations of the QR algorithm have been proposed....
详细信息
The QR algorithm has proven to be the most successful sequential algorithm for finding all eigenvalues of a dense nonsymmetric matrix. Recently, several parallel implementations of the QR algorithm have been proposed. These implementations have all been based on the column-wrapped storage scheme. While column-wrapped storage has been used to obtain asymptotically 100% efficiency for many algorithms for solving linear systems, eigenvalue algorithms using such storage have achieved, at best, asymptotically constant efficiency. The authors present a parallel implementation of the QR algorithm that makes use of an alternate storage scheme that allows them to achieve asymptotically 100% efficiency. They also present a technique, which they call bundling, that further improves the observed efficiency of the parallel implementation.< >
A method for parallelizing the image reconstruction of a fan beam computed tomography (CT) scanner and a processor array adapted for computing the back-projection reconstruction algorithm (BPR) are presented. By makin...
详细信息
A method for parallelizing the image reconstruction of a fan beam computed tomography (CT) scanner and a processor array adapted for computing the back-projection reconstruction algorithm (BPR) are presented. By making use of some inherent parallelism in PBR, a modular ring architecture in multiple of eight processors is shown to be well suited for a class of CT algorithms. The parallel architecture encompasses a reconfigurable binary tree to accommodate a wide range of processor interconnections needed for different image-processing schemes, ranging from front-end filtering, calibration, and fast Fourier transform to postprocessing such as artifact removal and image compression.< >
There are efficient sequential algorithms that use linear programming (LP) for computing maximum weight matchings. Finding a deterministic parallel algorithm for computing maximum weight matchings in complete graphs h...
详细信息
There are efficient sequential algorithms that use linear programming (LP) for computing maximum weight matchings. Finding a deterministic parallel algorithm for computing maximum weight matchings in complete graphs has been an open problem for some time. Since LP is known to be P-complete, then, by the parallel computation thesis, it is unlikely that there exists an NC algorithm that uses LP to solve the maximum weight matching problem. The authors present an LP-based parallel algorithm for maximum weight matching in a complete weighted graph. The algorithm is designed for the EREW PRAM model of parallel computation, and runs in O(n/sup 3//p+n/sup 2/logn) time for p >
The complexity of most machine learning techniques can be improved by transforming iterative components into their parallel equivalent. The parallel architecture of the Connection Machine provides a platform for the i...
详细信息
The complexity of most machine learning techniques can be improved by transforming iterative components into their parallel equivalent. The parallel architecture of the Connection Machine provides a platform for the implementation and evaluation of parallel learning techniques. The architecture of the Connection Machine is described along with limitations of the language interface that constrain the implementation of learning programs. Connection Machine implementations of two learning programs, perceptron and AQ, are described, and their computational complexity is compared to that of the corresponding sequential versions using actual runs on the Connection Machine. Techniques for parallelizing ID3 are also analyzed, and the advantages and disadvantages of parallel implementation on the Connection Machine are discussed in the context of machine learning.< >
暂无评论