k-core decomposition is a social network analytic that can be applied to centrality analysis, visualization, and community detection. There is a simple and well-known linear time sequential algorithm to perform k-core...
详细信息
k-core decomposition is a social network analytic that can be applied to centrality analysis, visualization, and community detection. There is a simple and well-known linear time sequential algorithm to perform k-core decomposition, and for this reason, k-core decomposition is widely used as a network analytic. In this work, we present a new shared-memory parallel algorithm called PKC for k-core decomposition on multicore platforms. This approach improves on the state-of-the-art implementations for k-core decomposition algorithms by reducing synchronization overhead and creating a smaller graph to process high-degree vertices. We show that PKC consistently outperforms implementations of other methods on a 32-core multicore server and on a collection of large sparse graphs. We achieve a 2.81× speedup (geometric mean of speedups for 17 large graphs) over our implementation of the ParK k-core decomposition algorithm.
We describe a parallel, adaptive, multiblock algorithm for explicit integration of time dependent partial differential equations on two-dimensional Cartesian grids. The grid layout we consider consists of a nested hie...
详细信息
Random graphs (or networks) have gained a significant increase of interest due to its popularity in modeling and simulating many complex real-world systems. Degree sequence is one of the most important aspects of thes...
详细信息
Random graphs (or networks) have gained a significant increase of interest due to its popularity in modeling and simulating many complex real-world systems. Degree sequence is one of the most important aspects of these systems. Random graphs with a given degree sequence can capture many characteristics like dependent edges and non-binomial degree distribution that are absent in many classical random graph models such as the Erdöos-Rényi graph model. In addition, they have important applications in uniform sampling of random graphs, counting the number of graphs having the same degree sequence, as well as in string theory, random matrix theory, and matching theory. In this paper, we present an OpenMP-based shared-memory parallel algorithm for generating a random graph with a prescribed degree sequence, which achieves a speedup of 20.4 with 32 cores. We also present a comparative study of several structural properties of the random graphs generated by our algorithm with that of the real-world graphs and random graphs generated by other popular methods. One of the steps in our parallel algorithm requires checking the Erdöos-Gallai characterization, i.e., whether there exists a graph obeying the given degree sequence, in parallel. This paper presents a non-trivial parallel algorithm for checking the Erdöos-Gallai characterization, which achieves a speedup of 23 with 32 cores.
This paper generalizes the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices. We assume a general sparse matrix A has been decomposed as P AQ = LU on a distributed memory parallel m...
详细信息
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run e...
详细信息
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA). More precisely, the dynamic programming for the SWA repeatedly performs the same computation O(mn) times. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for SWA accelerates the computation by over 447 times as compared to a single CPU implementation.
A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the co...
详细信息
A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone;it must be modeled explicitly. This pape...
详细信息
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone;it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on a model of shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement, and the optimal number of cores to maximize the performance of a parallel algorithms for a specific problem size under a given energy budget. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters. (C) 2011 Elsevier Inc. All rights reserved.
In order to solve a differential problem, the Laplace Transform method, when applicable, replaces the problem with a simpler one;the solution is obtained by solving the new problem and then by computing the inverse La...
详细信息
In order to solve a differential problem, the Laplace Transform method, when applicable, replaces the problem with a simpler one;the solution is obtained by solving the new problem and then by computing the inverse Laplace Transform of this function. In a numerical context, since the solution of the transformed problem consists of a sequence of Laplace Transform samples, most of the software for the numerical inversion cannot be used since the transform, among parameters, must be passed as a function. To fill this gap, we present Talbot Suite DE, a C software collection for Laplace Transform inversions, specifically designed for these problems and based on Talbot's method. It contains both sequential and parallel implementations;the latter is accomplished by means of OpenMP. We also report some performance results. Aimed at non-expert users, the software is equipped with several examples and a User Guide that includes the external documentation, explains how to use all the sample code, and reports its results about accuracy and efficiency. Some examples are entirely in C and others combine different programming languages (C/MATLAB, C/FORTRAN). The User Guide also contains useful hints to avoid possible errors issued during the compilation or execution of mixed-language code.
This paper presents a parallel, GPU-based, deforming mesh-enabled unsteady numerical solver for solving moving body problems by using OpenACC. Both the 2D and 3D parallel algorithms based on spring-like deforming mesh...
详细信息
This paper presents a parallel, GPU-based, deforming mesh-enabled unsteady numerical solver for solving moving body problems by using OpenACC. Both the 2D and 3D parallel algorithms based on spring-like deforming mesh methods are proposed and then implemented through OpenACC programming model. Furthermore, these algorithms are coupled with an unstructured mesh based, implicit time scheme integrated numerical solver, which makes the full GPU version of the solver capable of handling unsteady calculations with deforming mesh. Experiments results show that the proposed parallel deforming mesh algorithm can achieve over 2.5x speedup on K20 GPU card in comparison with 20 OpenMP threads on Intel E5-2658 V2 CPU cores. And both 2D and 3D cases are conducted to validate the efficiency, correctness, and accuracy of the present solver.
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction a...
详细信息
ISBN:
(纸本)9781509042975
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction algorithms. The Singular Value Decomposition (SVD) is a data reduction technique that has been used in many different applications. For example, SVDs have been extensively used in text analysis. The best known sequential algorithms for the computation of SVDs take cubic time which may not be acceptable in practice. As a result, many parallel algorithms have been proposed in the literature. There are two kinds of algorithms for SVD, namely, QR decomposition and Jacobi iterations. Researchers have found out that even though QR is sequentially faster than Jacobi iterations, QR is difficult to parallelize. As a result, most of the parallel algorithms in the literature are based on Jacobi iterations. For example, the Jacobi Relaxation Scheme (JRS) of the classical Jacobi algorithm has been shown to be very effective in parallel. In this paper we propose a novel variant of the classical Jacobi algorithm that is more efficient than the JRS algorithm. Our experimental results confirm this assertion. The key idea behind our algorithm is to select the pivot elements for each sweep appropriately. We also show how to efficiently implement our algorithm on such parallel models as the PRAM and the mesh.
暂无评论