We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this ...
详细信息
We develop and experiment with a new parallel algorithm to approximate the maximum weight cut in a weighted undirected graph. Our implementation is based on the recent new algorithm of Goemans and Williamson for this problem. However, our work aims for an efficient, practical formulation of the algorithm with close to optimal parallelization. Our theoretical analysis and an implementation on the Connection Machine CM5 show that our parallelization achieves linear speedup. We test our implementation on several large graphs (more than 9000 vertices), particularly on large instances of the Ising model.
Data-parallel computations with regular structure - fixed data size and predictable control patterns - can be implemented efficiently on SIMD architectures. However many large applications have irregular structure, ei...
详细信息
Data-parallel computations with regular structure - fixed data size and predictable control patterns - can be implemented efficiently on SIMD architectures. However many large applications have irregular structure, either data sets that vary in size as the computation progresses or control structures that select different subsets of the processors at each stage of the computation. In this paper we describe a stochastic biology simulation and some of the methods we used to improve its performance on the MasPar MP-1104. We present a simple model for evaluating the performance of a data parallel application and use the model to improve the performance of the simulator.
Wavelet transforms have proven to be useful tools for several applications, including signal analysis, signal coding, and image compression. this paper surveys the VLSI architecturesthat have been proposed for comput...
详细信息
Wavelet transforms have proven to be useful tools for several applications, including signal analysis, signal coding, and image compression. this paper surveys the VLSI architecturesthat have been proposed for computing the Discrete and Continuous Wavelet Transforms for 1-D and 2-D signals. the proposed architectures range from SIMD arrays to folded architectures such as systolic arrays and parallel filters. the SIMD arrays have a size that is proportional to that of the data sequence and are optimal with respect to time. the folded architectures, on the other hand, support single chip implementations and are optimal with respect to both area and time under the word-serial model.
A new parallel-in-parallel-out bit-level pipelined multiplier is presented to perform multiplication in GF(2m). this new multiplier uses m2basic cells where each cell has 2 2-input AND, 2 2-input XOR and 3 1-bit latch...
详细信息
A new parallel-in-parallel-out bit-level pipelined multiplier is presented to perform multiplication in GF(2m). this new multiplier uses m2basic cells where each cell has 2 2-input AND, 2 2-input XOR and 3 1-bit latches. the system latency of this multiplier is m+1 compared to 3m in previous architectures. the number of latches required per cell has also been reduced from 7 to 3. We also present a bit-level pipelined parallel-in-parallel-out squarer. this squarer has a system latency of [m/2] compared to 3m in previous designs and is 25% more hardware efficient. the critical paths in boththese proposed designs are the same as in existing designs.
this paper deals withthe problem of unsupervised classification of images modeled by Markov Random Fields (MRF). If the model parameters are known then we have various methods to solve the segmentation problem (simul...
详细信息
ISBN:
(纸本)0818670428
this paper deals withthe problem of unsupervised classification of images modeled by Markov Random Fields (MRF). If the model parameters are known then we have various methods to solve the segmentation problem (simulated annealing, ICM, etc...). However, when they are not known, the problem becomes more difficult. One has to estimate the hidden label field parameters from the only observable image. Our approach consists of extending a recent iterative method of estimation, called Iterative Conditional Estimation (ICE) to a hierarchical markovian model. the idea resembles the Estimation-Maximization (EM) algorithm as we recursively look at the Maximum a Posteriori (MAP) estimate of the label field given the estimated parameters then we look at the Maximum Likelihood (ML) estimate of the parameters given a tentative labeling obtained at the previous step. We propose unsupervised image classification algorithms using a hierarchical model. the only parameter supposed to be known is the number of regions, all the other parameters are estimated. the presented algorithms have been implemented on a Connection Machine CM200. Comparative tests have been done on noisy synthetic and real images (remote sensing).
the parallel implementation of segmented image coding methods is investigated. Various approaches are examined and the results are evaluated using three different segmentation techniques permitting also a comparison o...
详细信息
the parallel implementation of segmented image coding methods is investigated. Various approaches are examined and the results are evaluated using three different segmentation techniques permitting also a comparison of them. the results show which approach is more efficient and also provide an insight in the design of parallel VLSI chips for faster coding.
dbC is a data parallel extension to ANSI C similar to thinking Machines C* and MasPar MPL. To facilitate bit-oriented computation, dbC supports computation with arbitrary precision integer data, bit string extraction ...
详细信息
dbC is a data parallel extension to ANSI C similar to thinking Machines C* and MasPar MPL. To facilitate bit-oriented computation, dbC supports computation with arbitrary precision integer data, bit string extraction and insertion, and function parameters with dynamic bit length. In this paper, we describe dbC and its mapping to three very different architectures: (1) Terasys, an experimental SIMD machine which associates a single-bit processor with a column of SRAM memory;(2) Splash-2, a reconfigurable logic array of Xilinx Field Programmable Gate Arrays (FPGAs) that operates as a parallel co-processor;and (3) clusters of Condor-controlled workstations communicating via the MNFS distributed shared memory.
Jacobi method has been used on special-purpose multi-processor VLSI systems for parallel singular value decomposition (SVD) of dense matrices, and CORDIC processors are often used as the basic processing elements to i...
详细信息
Jacobi method has been used on special-purpose multi-processor VLSI systems for parallel singular value decomposition (SVD) of dense matrices, and CORDIC processors are often used as the basic processing elements to implement the two-sided rotations, the fundamental operations in the Jacobi method. Recently, generalizations of the original CORDIC algorithm to multi-dimensional spaces have been used in the SVD of complex matrices to achieve faster computation speed. A further speed-up of more than 2 can be gained by gradually refining the resolution of the CORDIC algorithms used in the Jacobi method.
this paper presents a new parallel algorithm for sparse matrix factorization. this algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta a...
详细信息
this paper presents a new parallel algorithm for sparse matrix factorization. this algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta and Kumar. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtree-to-subcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. the new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. this new algorithm achieves up to 20 GFlops on a 1024-processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
A Computational Electromagnetics (CEM) effort for the time-domain Maxwell’s equations, employing some of the algorithmic rigors of Computational Fluid Dynamics (CFD), is described. Using this work as a base, the goal...
详细信息
暂无评论