Pair potentials or kernels, psi(vertical bar r vertical bar), play a critical role in a number of areas;these include biophysics, electrical engineering, fluid dynamics, diffusion physics, solid state physics, and man...
详细信息
Pair potentials or kernels, psi(vertical bar r vertical bar), play a critical role in a number of areas;these include biophysics, electrical engineering, fluid dynamics, diffusion physics, solid state physics, and many more. The need to evaluate these potentials rapidly for N particles gives rise to the classical N-body problem. In this paper, we present scalable parallel algorithms for evaluation of these potentials for highly non-uniform distributions. The underlying methodology for evaluating these potentials relies on the accelerated Cartesian expansion (ACE) framework that is quasi-kernel-independent with the requirement that the kernel be differentiable with known derivatives. The results presented demonstrate the accuracy control, low cost, and parallel scalability offered by this method for several example kernels and distributions of up to 5 billion particles on 16384 CPU cores. Potential applications of the algorithm include various disciplines of computational physics, engineering, machine learning, among others. (C) 2020 Elsevier B.V. All rights reserved.
One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome seque...
详细信息
One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) in a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these sequences to a reference genome are in great demand, and recently, there has been ample work for publishing such algorithms. One important feature of these algorithms is the support of multithreaded parallel computing in order to speedup the mapping process. In this paper, we design parallel algorithms, which make use of the message-passing parallelism model, to address this problem efficiently. The proposed algorithms also take into consideration the probability scores assigned to each base for occurring in a specific position of a sequence. In particular, we present parallel algorithms for mapping short degenerate and weighted DNA sequences to a reference genome.
Several leading-edge applications such as pathology detection, biometric identification, and face recognition are based mainly on blob and line detection. To address this problem, Eigen value computing has been common...
详细信息
Several leading-edge applications such as pathology detection, biometric identification, and face recognition are based mainly on blob and line detection. To address this problem, Eigen value computing has been commonly employed due to its accuracy and robustness. However, Eigen value computing requires a raised computational processing, intensive memory data access, and data overlapping, which involve higher execution times. To overcome these limitations, we propose in this paper a new parallel strategy to implement Eigen value computing using a graphics processing unit (GPU). Our contributions are (1) to optimize instruction scheduling to reduce the computation time, (2) to efficiently partition processing into blocks to increase the occupancy of streaming multiprocessors, (3) to provide efficient input data splitting on shared memory to benefit from its lower access time, and (4) to propose new data management of shared memory to avoid access memory conflict and reduce memory bank accesses. Experimental results show that our proposed GPU parallel strategy for Eigen value computing achieves speedups of 27 compared with a multithreaded implementation, of 16 compared with a predefined function in the OpenCV library, and of eight compared with a predefined function in the Cublas library, all of which are performed into a quad core multi-central-processing unit platform. Next, our parallel strategy is evaluated through an Eigen value-based method for retinal thick vessel segmentation, which is essential for detecting ocular pathologies. Eigen value computing is executed in 0.017 s when using Structured Analysis of the Retina database images. Accordingly, we achieved real-time thick retinal vessel segmentation with an average execution time of about 0.039 s. (C) 2020 SPIE and IS&T
Determinants has been used intensively in a variety of applications through history. It also influenced many fields of mathematics like linear algebra. Finding the determinants of a squared matrix can be done using a ...
详细信息
ISBN:
(纸本)9781479904617
Determinants has been used intensively in a variety of applications through history. It also influenced many fields of mathematics like linear algebra. Finding the determinants of a squared matrix can be done using a variety of methods, including well-known methods of Leibniz formula and Laplace expansion which calculate the determinant of any N×N matrix in O(n!). However, decomposition methods, such as: LU decomposition, Cholesky decomposition and QR decomposition, have replaced the native methods with a significantly reduced complexity of O(n^3). In this paper, we introduce two parallel algorithms for Laplace expansion and LU decomposition. Then, we analyze them and compare them with their perspective sequential algorithms in terms of run time, speed-up and efficiency, where new algorithms provided better results. At maximum, in Laplace expansion, it became 129% faster, whereas in LU Decomposition, it became 44% faster.
We present a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation. The concurrent nature of the RBF-PUM enables designing parallel algorithms for dealing with a l...
详细信息
We present a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation. The concurrent nature of the RBF-PUM enables designing parallel algorithms for dealing with a large number of scattered data-points in high space dimensions. To efficiently exploit this concurrency, our algorithm makes use of shared-memory parallel processors through the OpenCL standard. This efficiency is achieved by a parallel space partitioning strategy with linear computational time complexity with respect to the input and evaluation points. The speed of our approach allows for computationally more intensive construction of the interpolant. In fact, the RBF-PUM can be coupled with a cross-validation technique that searches for optimal values of the shape parameters associated with each local RBF interpolant, thus reducing the global interpolation error. The numerical experiments support our claims by illustrating the interpolation errors and the running times of our algorithm.
We propose a parallel synchronous and asynchronous implementation of the coupled simulated annealing (CSA) algorithm in a shared-memory architecture. The original CSA was implemented synchronously in a distributed-mem...
详细信息
We propose a parallel synchronous and asynchronous implementation of the coupled simulated annealing (CSA) algorithm in a shared-memory architecture. The original CSA was implemented synchronously in a distributed-memory architecture. It synchronizes at each temperature update, which leads to idling and loss of efficiency when increasing the number of processors. The proposed synchronous CSA (SCSA) is implemented as the original, but in a shared-memory architecture. The proposed asynchronous CSA (ACSA) does not synchronize, allowing a larger parallel efficiency for larger numbers of processors. Results from extensive experiments show that the proposed ACSA presents much better quality of solution when compared to the serial and to the SCSA. The experiments also show that the performance of the proposed ACSA is better than the SCSA for less computationally intensive problems or when a larger number of processing cores are available. Moreover, the parallel efficiency of the ACSA improves by increasing the size of the problem. With the advent of the Multi-core Era, the use of the proposed algorithm becomes more attractive than the original synchronous CSA.
We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating...
详细信息
ISBN:
(纸本)9781450362177
We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating weighted matchings in the massively parallel computation (MPC) model and in the streaming model. For the MPC and the multi-pass streaming model, we show that any algorithm computing a (1- delta)-approximate unweighted matching in bipartite graphs can be translated into an algorithm that computes a (1 - epsilon(delta))-approximate maximum weighted matching. Furthermore, this translation incurs only a constant factor (that depends on epsilon > 0) overhead in the complexity. Instantiating this with the current best MPC algorithm for unweighted matching yields a (1 - epsilon)-approximation algorithm for maximum weighted matching that uses O-epsilon (log logn) rounds, O(m/n) machines per round, and Oe (n poly(logn)) memory per machine. This improves upon the previous best approximation guarantee of (1/2 - epsilon) for weighted graphs. In the context of single-pass streaming with random edge arrivals, our techniques yield a (1/2 + c)-approximation algorithm thus breaking the natural barrier of 1/2.
The surge in availability of genomic data holds promise for enabling determination of genetic causes of observed individual traits, with applications to problems such as discovery of the genetic roots of phenotypes, b...
详细信息
The surge in availability of genomic data holds promise for enabling determination of genetic causes of observed individual traits, with applications to problems such as discovery of the genetic roots of phenotypes, be they molecular phenotypes such as gene expression or metabolite concentrations, or complex phenotypes such as diseases. However, the growing sizes of these datasets and the quadratic, cubic or higher scaling characteristics of the relevant algorithms pose a serious computational challenge necessitating use of leadership scale computing. In this paper we describe a new approach to performing vector similarity metrics calculations, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors. Our primary focus is the Proportional Similarity metric applied to Genome Wide Association Studies (GWAS) and Phenome Wide Association Studies (PheWAS). We describe the implementation of the algorithms on accelerated processors, methods used for eliminating redundant calculations due to symmetries, and techniques for efficient mapping of the calculations to many-node parallel systems. Results are presented demonstrating high per-node performance and parallel scalability with rates of more than five quadrillion (5 x 10(15)) elementwise comparisons achieved per second on the ORNL Titan system. In a companion paper we describe corresponding techniques applied to calculations of the Custom Correlation Coefficient for comparative genomics applications. (C) 2018 Elsevier B.V. All rights reserved.
Multiple Sequence Alignment (MSA) is a basic operation in Bioinformatics, and is used to highlight the similarities among a set of sequences. The MSA problem was proven NP-Hard, thus requiring a high amount of memory ...
详细信息
Multiple Sequence Alignment (MSA) is a basic operation in Bioinformatics, and is used to highlight the similarities among a set of sequences. The MSA problem was proven NP-Hard, thus requiring a high amount of memory and computing power. This problem can be modeled as a search for the path with minimum cost in a graph, and the A-Star algorithm has been adapted to solve it sequentially and in parallel. The design of a parallel version for MSA with A-Star is subject to challenges such as irregular dependency pattern and substantial memory requirements. In this paper, we propose PA-Star, a locality sensitive multithreaded strategy based on A-Star, which computes optimal MSAs using both RAM and disk to store nodes. The experimental results obtained in 3 different machines show that the optimizations used in PA-Star can achieve an acceleration of 1.88x in the serial execution, and the parallel execution can attain an acceleration of 5.52 x with 8 cores. We also show that PA-Star outperforms a state-of-the-art MSA tool based on A-Star, executing up to 4.77 x faster. Finally, we show that our disk-assisted strategy is able to retrieve the optimal alignment when other tools fail. (C) 2017 Elsevier Inc. All rights reserved.
Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms...
详细信息
Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches. This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack. (C) 2017 The Author(s). Published by Elsevier Inc.
暂无评论