We describe a parallel, adaptive, multiblock algorithm for explicit integration of time dependent partial differential equations on two-dimensional Cartesian grids. The grid layout we consider consists of a nested hie...
详细信息
In this paper we present a design and development of a parallel Firefly meta-heuristic algorithm for option pricing. We study the parallel algorithm for performance both theoretically and experimentally. Our implement...
详细信息
In this paper we present a design and development of a parallel Firefly meta-heuristic algorithm for option pricing. We study the parallel algorithm for performance both theoretically and experimentally. Our implementation of the algorithm exhibits significant speedup for financial option pricing problem and demonstrates the utility of our parallel algorithm even when the problem size with large number of Fireflies is deployed. We also present a detailed analysis of the theoretical runtime cost of firefly algorithm on both the RAM and P-RAM models of computation. Moreover, we identify certain issues in the algorithm regarding global memory access pattern, which could be studied for further improvement.
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run e...
详细信息
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA). More precisely, the dynamic programming for the SWA repeatedly performs the same computation O(mn) times. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for SWA accelerates the computation by over 447 times as compared to a single CPU implementation.
We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization (AO) framework and solv...
详细信息
ISBN:
(纸本)9781509041183
We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization (AO) framework and solve each matrix nonnegative least-squares problem via a Nesterov-type algorithm for strongly convex problems. We describe a parallel implementation of the algorithm and measure the speedup attained by itsMessage Passing Interface implementation on a parallel computing environment. It turns out that the attained speedup is significant, rendering our algorithm a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.
Random graphs (or networks) have gained a significant increase of interest due to its popularity in modeling and simulating many complex real-world systems. Degree sequence is one of the most important aspects of thes...
详细信息
Random graphs (or networks) have gained a significant increase of interest due to its popularity in modeling and simulating many complex real-world systems. Degree sequence is one of the most important aspects of these systems. Random graphs with a given degree sequence can capture many characteristics like dependent edges and non-binomial degree distribution that are absent in many classical random graph models such as the Erdöos-Rényi graph model. In addition, they have important applications in uniform sampling of random graphs, counting the number of graphs having the same degree sequence, as well as in string theory, random matrix theory, and matching theory. In this paper, we present an OpenMP-based shared-memory parallel algorithm for generating a random graph with a prescribed degree sequence, which achieves a speedup of 20.4 with 32 cores. We also present a comparative study of several structural properties of the random graphs generated by our algorithm with that of the real-world graphs and random graphs generated by other popular methods. One of the steps in our parallel algorithm requires checking the Erdöos-Gallai characterization, i.e., whether there exists a graph obeying the given degree sequence, in parallel. This paper presents a non-trivial parallel algorithm for checking the Erdöos-Gallai characterization, which achieves a speedup of 23 with 32 cores.
A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the co...
详细信息
A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.
This paper generalizes the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices. We assume a general sparse matrix A has been decomposed as P AQ = LU on a distributed memory parallel m...
详细信息
We develop a parallel algorithm based on proximal method to solve the problem of minimizing summation of convex (not necessarily smooth) functions over a star network. We show that this method converges to an optimal ...
详细信息
ISBN:
(纸本)9781509045839
We develop a parallel algorithm based on proximal method to solve the problem of minimizing summation of convex (not necessarily smooth) functions over a star network. We show that this method converges to an optimal solution for any choice of constant stepsize for convex objective functions. Under further assumption of Lipschitz-gradient and strong convexity of objective functions, the method converges linearly.
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone;it must be modeled explicitly. This pape...
详细信息
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone;it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on a model of shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement, and the optimal number of cores to maximize the performance of a parallel algorithms for a specific problem size under a given energy budget. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters. (C) 2011 Elsevier Inc. All rights reserved.
The availability of heterogeneous CPU+GPU systems has opened the door to new opportunities for the development of parallel solutions to tackle complex biological problems. The reconstruction of evolutionary histories ...
详细信息
The availability of heterogeneous CPU+GPU systems has opened the door to new opportunities for the development of parallel solutions to tackle complex biological problems. The reconstruction of evolutionary histories among species represents a grand computational challenge, which can be addressed by exploiting this kind of hardware designs. In this research, we study the application of heterogeneous computing with OpenCL to accelerate one of the most well-known objective functions for inferring phylogenies, the phylogenetic parsimony function. For this purpose, we undertake the design of CPU and GPU kernel implementations of this relevant function, proposing a heterogeneous CPU+GPU multidevice approach that distributes multiple parsimony evaluations among processing devices. Experiments on 6 real nucleotide data sets and comparisons with other parallel implementations give account of the benefits of the proposal in this paper, obtaining significant parallel results by combining CPU and GPU capabilities in accordance with the characteristics of the input data.
暂无评论