Biological sequence comparison is an important tool for researchers in molecular biology. there are several algorithms for sequence comparison. the Smith-Waterman algorithm, based on dynamic programming, is one of the...
详细信息
We present a generalized self-scattering method for generating carrier free flight times in Monte Carlo simulation. Compared to traditional approaches, the added flexibility of this approach results in fewer fictitiou...
We present a generalized self-scattering method for generating carrier free flight times in Monte Carlo simulation. Compared to traditional approaches, the added flexibility of this approach results in fewer fictitious scatterings, which is especially appealing for load balance and efficiency when a SIMD parallel computer is used. Speedups from 19% to 69% over an optimized variable-Gamma approach are shown for an implementation on the Connection Machine CM-2. the performance sensitivities to applied fields and grid spacings are also presented. the conversion of existing variable-Gamma software to this new approach requires only a few changes.
Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. this work presents an approach to transparently (without programmer intervention)...
详细信息
ISBN:
(纸本)9781467395243
Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. this work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithmaware coarsening of messages sent during vertex visitation, reducing boththe number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131, 000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.
Digital FIR filters can be efficiently implemented using distributed arithmetic (DA). Original DA provides low throughput. parallel DA is proven to be a promising technique for efficient DA implementation. Block-based...
详细信息
the solution of large-scale problems in Computational Science and Engineering relies on the availability of accurate, robust and efficient numerical algorithms and software that are able to exploit the power offered b...
详细信息
ISBN:
(纸本)9783642400476
the solution of large-scale problems in Computational Science and Engineering relies on the availability of accurate, robust and efficient numerical algorithms and software that are able to exploit the power offered by modern computer architectures. Such algorithms and software provide building blocks for prototyping and developing novel applications, and for improving existing ones, by relieving the developers from details concerning numerical methods as well as their implementation in new computing environments.
In many applications, matrix multiplication involves different shapes of matrices. the shape of the matrix can significantly impact the performance of matrix multiplication algorithm. this paper describes extensions o...
详细信息
ISBN:
(纸本)0769521525
In many applications, matrix multiplication involves different shapes of matrices. the shape of the matrix can significantly impact the performance of matrix multiplication algorithm. this paper describes extensions of the SRUMMA parallel matrix multiplication algorithm [1] to improve performance of transpose and rectangular matrices. Our approach relies on a set of hybrid algorithms which are chosen based on the shape of matrices and transpose operator involved. the algorithm exploits performance characteristics of clusters and shared memory systems: it differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. the experimental results on clusters and shared memory systems demonstrate consistent performance advantages over pdgemm from the ScaLAPACK parallel linear algebra package.
We propose an improved version of the CGS method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices. the proposed method combines elements of numerical stability an...
详细信息
ISBN:
(纸本)0769515126
We propose an improved version of the CGS method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices. the proposed method combines elements of numerical stability and parallel algorithm design without increasing computational costs. the algorithm is derived such that all matrix-vector multiplication, inner products and vector updates of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time of vector updates. therefore, the cost of global communication which represents the bottleneck of the performance can be significantly reduced. In this paper, the Bulk Synchronous parallel (BSP) model is used to design a fully efficient, scalable and portable parallel proposed algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec, and a cluster of workstations connected by an Ethernet. this performance model uses only a few system dependent parameters based on a simple and accurate cost modelling to provide useful insight in the time complexity of the method. the theoretical performance prediction are compared with some preliminary measured timing results of a numerical application from ocean flow simulation.
parallelprocessing and the methods to program, coordinate, and operate such computing systems and architectures have become a vast and highly pursued research area. these computing environments have quickly become a ...
详细信息
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to matc...
详细信息
ISBN:
(纸本)9780735412125
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. the geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. the most efficient like conjugate gradients are very complex to program for the required derivative information the least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. this paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. the efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
暂无评论