The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run e...
详细信息
ISBN:
(纸本)9780769561493
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA). More precisely, the dynamic programming for the SWA repeatedly performs the same computation O(mn) times. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for SWA accelerates the computation by over 447 times as compared to a single CPU implementation.
In this paper we develop a parallel approach for computing the modularity clustering often used to identify and analyse communities in social networks. We show that modularity can be approximated by looking at the lar...
详细信息
In this paper we develop a parallel approach for computing the modularity clustering often used to identify and analyse communities in social networks. We show that modularity can be approximated by looking at the largest eigenpairs of the weighted graph adjacency matrix that has been perturbed by a rank one update. Also, we generalize this formulation to identify multiple clusters at once. We develop a fast parallel implementation for it that takes advantage of the Lanczos eigenvalue solver and k-means algorithm on the GPU. Finally, we highlight the performance and quality of our approach versus existing state-of-the-art techniques. (C) 2017 The Authors. Published by Elsevier B.V.
In this work, we present a new parallel-driven approach to speed up Optimum-Path Forest (OPF) training phase. In addition, we show how to make OPF up to five times faster for training using a simple parallel-friendly ...
详细信息
ISBN:
(纸本)9783319522777;9783319522760
In this work, we present a new parallel-driven approach to speed up Optimum-Path Forest (OPF) training phase. In addition, we show how to make OPF up to five times faster for training using a simple parallel-friendly data structure, which can achieve the same accuracy results to the ones obtained by traditional OPF. To the best of our knowledge, we have not observed any work that attempted at parallelizing OPF to date, which turns out to be the main contribution of this paper. The experiments are carried out in four public datasets, showing the proposed approach maintains the trade-off between efficiency and effectiveness.
This paper introduces an optimal representation for a right-to-left parallel elliptic curve scalar point multiplication. The right-to-left approach is easier to parallelize than the conventional left-to-right approach...
详细信息
ISBN:
(纸本)9781538620878
This paper introduces an optimal representation for a right-to-left parallel elliptic curve scalar point multiplication. The right-to-left approach is easier to parallelize than the conventional left-to-right approach. However, unlike the left-to-right approach, there is still no work considering number representations for the right-to-left parallel calculation. By simplifying the implementation by Robert, we devise a mathematical model to capture the computation time of the calculation. Then, for any arbitrary amount of doubling time and addition time, we propose algorithms to generate representations which minimize the time in that model. As a result, we can show a negative result that a conventional representation like NAF is almost optimal. The parallel computation time obtained from any representation cannot be better than NAF by more than 1%.
The unsatisfactory operation of a parallel multigrid algorithm is caused by two reasons: the imbalanced load of processors and the intensive exchanges of data between them. The further development of the parallel univ...
详细信息
The longest common subsequence (LCS) problem is one of the most useful algorithms being applied in various research areas. This problem is known to be NP-hard for arbitrary data. In this paper, we present a parallel L...
详细信息
ISBN:
(纸本)9781538634417
The longest common subsequence (LCS) problem is one of the most useful algorithms being applied in various research areas. This problem is known to be NP-hard for arbitrary data. In this paper, we present a parallel LCS algorithm using the GPU-based OpenACC model, which is based on the existing dynamic approach and parallel anti-diagonal scheme that is applied in order to eliminate the data dependencies. The proposed algorithm in this paper has been benchmarked using four different computing models: OpenMPI, OpenMP, hybrid OpenMPI & OpenMP, and OpenACC model. The parallel LCS algorithm has been implemented using Swiss-Prot databases over these computing models, so that their execution times, speed-ups and speed-ratios have been measured and analogized among them extensively. Our experimental results reveal that the computation of our algorithm on OpenACC (on GPU) is around 16 times faster than the execution on a single CPU, and around 2 times faster than on the octa-core processor systems. The performance of the OpenACC model stands out among the four tested models in solving the LCS problem.
What we call "generate-test-α" is a computation pattern in which we do some extra computation, such as choosing the optimal solution, after the usual generate & test computation that enumerates all solu...
详细信息
What we call "generate-test-α" is a computation pattern in which we do some extra computation, such as choosing the optimal solution, after the usual generate & test computation that enumerates all solutions passing the test. A naive parallel algorithm of the generate-test-α can be given as a composition of parallel skeletons, but it will suffer from a heavy computation cost when the number of generated candidates is large. Such a situation often occurs when we generate a set of substructures from a source data structure. It is known in the field of skeletal parallel programming that a certain class of simplified computation without test phases can be given efficient linear cost algorithms by making systematic transformations exploiting semirings. However, no transformation is known as yet to optimize the generate-test-α computation uniformly. In this paper, we propose a novel transformation to embed the test phases into semirings so that generate-test-α computation can be transformed into a simplified generate-α computation. This transformation allows us to reuse efficient parallel algorithms of generate-α for the generate-test-α computation. In addition, we give powerful optimizations for a class of generate-α computations, so that we can give uniform optimizations for a wide class of generate-test-α computations.
Computing a minimum spanning tree (MST) of a graph is a fundamental problem in Graph Theory and arises as a subproblem in many applications. In this paper, we propose a parallel MST algorithm and implement it on a GPU...
详细信息
ISBN:
(纸本)9781538648193
Computing a minimum spanning tree (MST) of a graph is a fundamental problem in Graph Theory and arises as a subproblem in many applications. In this paper, we propose a parallel MST algorithm and implement it on a GPU (Graphics Processing Unit). One of the steps of previous parallel MST algorithms is a heavy use of parallel list ranking. Besides the fact that list ranking is present in several parallel libraries, it is very time-consuming. Using a different graph decomposition, called strut, we devised a new parallel MST algorithm that does not make use of the list ranking procedure. Based on the BSP/CGM model we proved that our algorithm is correct and it finds the MST after O(log p) iterations (communication and computation rounds). To show that our algorithm has a good performance on real parallel machines, we have implemented it on GPU. The way that we have designed the parallel algorithm allowed us to exploit the computing power of the GPU. The efficiency of the algorithm was confirmed by our experimental results. The tests performed show that, for randomly constructed graphs, with vertex numbers varying from 10,000 to 30,000 and density between 0.02 and 0.2, the algorithm constructs an MST in a maximum of six iterations. When the graph is not very sparse, our implementation achieved a speedup of more than 50, for some instances as high 296, over a minimum spanning tree sequential algorithm previously proposed in the literature.
Existing parallel algorithms for wavelet tree construction have a work complexity of O(n log sigma). This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based ...
详细信息
ISBN:
(纸本)9781509067213
Existing parallel algorithms for wavelet tree construction have a work complexity of O(n log sigma). This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based on parallel integer sorting and has either O(n log log n [log sigma/root log n log log n ]) work and polylogarithmic depth, or O(n [ log sigma/root log n ]) work and sub- linear depth. We also describe another algorithm that has O(n [ log sigma/root log n] ) work and O(sigma + logn) depth. We then show how to use similar ideas to construct variants of wavelet trees (arbitrary- shaped binary trees and multiary trees) as well as wavelet matrices in parallel with lower work complexity than prior algorithms. Finally, we show that the rank and select structures on binary sequences and multiary sequences, which are stored on wavelet tree nodes, can be constructed in parallel with improved work bounds, matching those of the best existing sequential algorithms for constructing rank and select structures.
Many real-world networks, including online social networks and communication networks, are commonly modeled as temporal graphs. Answering earliest-arrival queries in temporal graphs is one of the most fundamental stud...
详细信息
ISBN:
(纸本)9781538610428
Many real-world networks, including online social networks and communication networks, are commonly modeled as temporal graphs. Answering earliest-arrival queries in temporal graphs is one of the most fundamental studies with numerous applications, such as information diffusion and measuring temporal closeness centrality. As graph sizes are growing rapidly, speedup of query execution time becomes even more important. In this paper, we propose a novel edge-centric parallel algorithm for solving single-source earliest-arrival problem in temporal graphs based on a new data structure named Edge-Scan-Dependency Graph (ESD-Graph). We evaluate the proposed parallel algorithm by theoretical analysis as well as by empirical experiments on real-world temporal graphs and synthetic graphs. Empirical results show that the new parallel algorithm outperforms the existing serial algorithm by up to 8.2 and 9.5 times on multi-core processors for real-world data and synthetic data respectively.
暂无评论