parallel computing of the transient radiative transfer process in participating media is studied with an integral equation model. Two numerical quadratures are used: the discrete rectangular volume (DRV) method and YI...
详细信息
parallel computing of the transient radiative transfer process in participating media is studied with an integral equation model. Two numerical quadratures are used: the discrete rectangular volume (DRV) method and YIX method. The parallel versions of both methods are developed for one-dimensional and three-dimensional geometries, respectively. Both quadratures achieve good speedup in parallel performance. Because the integral equation model uses very small amount of memory, the parallel computing can take advantage of having each processor store the full spatial domain information without using the typical domain decomposition parallelism, which will be necessary in other solution methods, for example, discrete ordinates and finite volume methods, for large-scale simulations. The parallel computation is conducted by assigning a different portion of the quadrature to different compute node. In DRV method a variation of the spatial domain decomposition is used. In the case of YIX scheme, the angular quadrature is divided up according to the number of compute nodes. These parallel schemes minimize the communications overhead. Two new discrete ordinate sets are used in the YIX angular quadrature, and their parallel performances are discussed. One of the discrete ordinates sets, called a spherical ring set, is also suitable for use in the conventional discrete ordinates method.
Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation...
详细信息
Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation-intensive process. Previous parallel implementations use straightforward domain decomposition based on existing sequential algorithms and rely on parallel machines with low-latency interconnection network and fast hardware support for processor synchronization. Our research exploits the superior cost effectiveness and flexibility achieved through low-cost clusters to speed up biological computations by designing communication-efficient parallel algorithms for pattern identification. This paper presents a low communication-overhead parallel algorithm for pattern identification in biological sequences. Given a biological sequence of length n and a pattern of length m, we conclude an algorithm with five computation/communication phases, each requiring O(n) computation time and only O(p) message units. The low communication overhead of the algorithm is essential in achieving reasonable speedups on clusters, where the inter-processor communication latency is usually higher.
In this paper we consider a method for finding several eigenvalues and corresponding eigenvectors of large-scale generalized eigenvalue problems. In this method, a small matrix pencil that has only the desired eigenva...
详细信息
In a two- or three-dimensional image array, the computation of Euclidean distance transform (EDT) is an important task. With the increasing application of 3D voxel images, it is useful to consider the distance transfo...
详细信息
In a two- or three-dimensional image array, the computation of Euclidean distance transform (EDT) is an important task. With the increasing application of 3D voxel images, it is useful to consider the distance transform of a 3D digital image array. Because the EDT computation is a global operation, it is prohibitively time consuming when performing the EDT for image processing. In order to provide the efficient transform computations, parallelism is employed. In this paper, we first derive several important geometry relations and properties among parallel planes. We then, develop a parallel algorithm for the three-dimensional Euclidean distance transform (3D_EDT) on the EREW PRAM computation model. The time complexity of our parallel algorithm is O(jog(2) N) for an N x N x N image array and this is currently the best known result. A generalized parallel algorithm for the 3D-EDT is also proposed. We implement the proposed algorithms sequentially, the performance of which exceeds the existing algorithms (proposed by Yamada, Toriwaki). Finally, we develop the corresponding parallel programs on both the emulated EREW PRAM model computer and the IBM SP2 to verify the speed-up properties of the proposed algorithms.
Motivated by the structure of a matrix factorization introduced recently by Evans (1999), we introduce a new WZ factorization for use with the partition method for parallel solution of tridiagonal systems. The factori...
详细信息
Motivated by the structure of a matrix factorization introduced recently by Evans (1999), we introduce a new WZ factorization for use with the partition method for parallel solution of tridiagonal systems. The factorization helps us to uncouple partitioned subsystems for parallel processing of their solution. A crucial question for the validity of the partition method is the existence and stability of the whole solution across the partitioning blocks . We show that if the given system is nonsingular and diagonally dominant, then within each block the WZ factorization exists and is (numerically) strongly stable, and the solution across the partitioning blocks exists (does not terminate prematurely).
With the recent DNA-microarray technology, it is possible to measure the expression levels of thousands of genes simultaneously in the same experiment. A genetic network is a model that describes how the expression le...
详细信息
With the recent DNA-microarray technology, it is possible to measure the expression levels of thousands of genes simultaneously in the same experiment. A genetic network is a model that describes how the expression level of each gene is affected by the expression levels of other genes in the network. In this paper we explore the use of parallel computers to infer genetic network architectures in gene expression analysis. Given the results of an experiment with n genes and m measures over time (m much less than n), we consider the problem of finding a subset of genes (k genes, where k much less than n) that explain the expression level of a given target gene under study. We consider the coarse-grained multicomputer (CGM) model, with p processors. We first present a sequential approximation algorithm of O(M(4)n) time and O(m(2)n)space. The main result is a new parallel approximation algorithm that determines the k genes in O(m(4) n/p) local computing time plus O(k) communication rounds, and with space requirement of O(m(2) n/p). The p factor in the parallel time and space complexities indicates a good parallelization . To our knowledge there are no CGM algorithms for the problem considered in this paper. We also show promising experimental results on a Beowulf machine. As will be shown in our experiments, we observe very promising speedups results, especially in the cases where the number of genes studied exceeds 4000. Notice that even with current microarray technology, microchips with around 15,000 spots are already possible. The proposed parallel method constitutes thus an excellent example of application of high-performance computing in this important field.
An efficient parallel numerical method is proposed for an integro-differential equation with positive memory. Instead of solving the equation in classical time-marching methods which require massive storage of solutio...
详细信息
An efficient parallel numerical method is proposed for an integro-differential equation with positive memory. Instead of solving the equation in classical time-marching methods which require massive storage of solutions of previous time steps in order to advance to a next time step, the Fourier-Laplace transformation in time is applied to obtain a set of complex-valued, elliptic problems parameterized by points on a contour in the complex plane. Using the independence of an elliptic problem corresponding to one contour point is independent of those elliptic problems corresponding to other contour points, all elliptic problems can be solved in parallel essentially without data communications. Then the time domain solution can be obtained by the Fourier-Laplace inversion formula. An error analysis and the numerical implementation of this parallel method is presented. (C) 2003 Elsevier B.V. All rights reserved.
In this paper we present a second order PVT (parallel variable transformation) algorithm converging to second order stationary points for minimizing smooth functions, based on the first order PVT algorithm due to Fuku...
详细信息
In this paper we present a second order PVT (parallel variable transformation) algorithm converging to second order stationary points for minimizing smooth functions, based on the first order PVT algorithm due to Fukushima (1998). The corresponding stopping criterion, descent condition and descent step for the second order PVT algorithm are given.
parallel merge sort is useful for sorting a large quantity of data progressively. The merge sort should be parallelized carefully since the conventional algorithm has poor performance due to the successive reduction o...
详细信息
parallel merge sort is useful for sorting a large quantity of data progressively. The merge sort should be parallelized carefully since the conventional algorithm has poor performance due to the successive reduction of the number of participating processors by half, and down to one in the last merging stage. The proposed load-balanced merge sort utilizes all processors throughout the computation. It evenly distributes data to all processors in each stage. Thus every processor is forced to work in all phases. Significant performance enhancement has been achieved up to a speedup of (P - 1)/log P where P is the number of processors. Experimental results demonstrate a speedup of 9.6 (upper bound of 10.7) on 32-processor Cray T3E when sorting 4M 32-bit integers, and a speed up of 2.3 (upper bound of 2.8) on an 8-node PC cluster.
In discrete optimization, most exact solution approaches are based on branch and bound, which is conceptually easy to parallelize in its simplest forms. More sophisticated variants, such as the so-called branch, cut, ...
详细信息
In discrete optimization, most exact solution approaches are based on branch and bound, which is conceptually easy to parallelize in its simplest forms. More sophisticated variants, such as the so-called branch, cut, and price algorithms, are more difficult to parallelize because of the need to share large amounts of knowledge discovered during the search process. In the first part of the paper, we survey the issues involved in parallelizing such algorithms. We then review the implementation of SYMPHONY and COIN/BCP, two existing frameworks for implementing parallel branch, cut, and price. These frameworks have limited scalability, but are effective on small numbers of processors. Finally, we briefly describe our next-generation framework, which improves scalability and further abstracts many of the notions inherent in parallel BCP, making it possible to implement and parallelize more general classes of algorithms.
暂无评论