We present parallel algorithms for computing cycle orders and cycle perimeters in relative neighborhood graphs. This parallel algorithm has wide-ranging applications from microscopic to macroscopic domains, e.g., in h...
详细信息
ISBN:
(纸本)9781538610428
We present parallel algorithms for computing cycle orders and cycle perimeters in relative neighborhood graphs. This parallel algorithm has wide-ranging applications from microscopic to macroscopic domains, e.g., in histopathological image analysis and wireless network routing. Our algorithm consists of the following steps (sub-algorithms): (1) Uniform partitioning of the graph vertices across processes, (2) parallel Delaunay triangulation and (3) parallel computation of the relative neighborhood graph and the cycle orders and perimeters. We evaluated our algorithm on a large dataset with 6.5 Million points and demonstrate excellent fixed-size scalability. We also demonstrate excellent isogranular scalability up to 131K processes. Our largest run was on a dataset with 13 billion points on 131K processes on ORNL's Cray XK7 "Titan" supercomputer.
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. Th...
详细信息
ISBN:
(纸本)9781611974782
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. This algorithm can be parallelized to give an algorithm that uses polynomially many processors and runs in O(log(3) n) time, stemming from O(log n) adaptive computations of a maximal independent set (MIS). Chung et al. (2014) developed faster local and parallel algorithms, potentially running in time O(log(2) n), but these algorithms work under significantly more stringent conditions than the LLL. We give a new parallel algorithm that works under essentially the same conditions as the original algorithm of Moser & Tardos but uses only a single MIS computation, thus running in O(log(2) n) time. This conceptually new algorithm also gives a clean combinatorial description of a satisfying assignment which might be of independent interest. Our techniques extend to the deterministic LLL algorithm given by Chandrasekaran et al. (2013) leading to an NC-algorithm running in time O(log(2) n) as well. We also provide improved bounds on the runtimes of the sequential and parallel resampling-based algorithms originally developed by Moser & Tardos. Our bounds extend to any problem instance in which the tighter Shearer LLL criterion is satisfied. We also improve on the analysis of Kolipaka & Szegedy (2011) to give tighter concentration results.
Networks (or graphs) are an effective abstraction for representing many real-world complex systems. Analyzing various structural properties of and dynamics on such networks reveal valuable insights about the behavior ...
详细信息
Networks (or graphs) are an effective abstraction for representing many real-world complex systems. Analyzing various structural properties of and dynamics on such networks reveal valuable insights about the behavior of such systems. In today's data-rich world, we are deluged by the massive amount of heterogeneous data from various sources, such as the web, infrastructure, and online social media. Analyzing this huge amount of data may take a prohibitively long time and even may not fit into the main memory of a single processing unit, thus motivating the necessity of efficient parallel algorithms in various high-performance computing (HPC) platforms. In this dissertation, we present distributed and shared memory parallel algorithms for some important network analytic problems. First, we present distributed memory parallel algorithms for switching edges in a network. Edge switch is an operation on a network, where two edges are selected randomly, and one of their end vertices are swapped with each other. This operation is repeated either a given number of times or until a specified criterion is satisfied. It has diverse real-world applications such as in generating simple random networks with a given degree sequence and in modeling and studying various dynamic networks. One of the steps in our edge switch algorithm requires generating multinomial random variables in parallel. We also present the first non-trivial parallel algorithm for generating multinomial random variables. Next, we present efficient algorithms for assortative edge switch in a labeled network. Assuming each vertex has a label, an assortative edge switch operation imposes an extra constraint, i. e., two edges are randomly selected and one of their end vertices are swapped with each other if the labels of the end vertices of the edges remain the same as before. It can be used to study the effect of the network structural properties on dynamics over a network. Although the problem of assortative edge
The greatest common divisor (GCD) is used for numerous applications in number theory, modular arithmetic, encryption algorithms such as RSA, random number generation, and solving linear Diophantine equations. High-per...
详细信息
ISBN:
(数字)9781728186559
ISBN:
(纸本)9781728186566
The greatest common divisor (GCD) is used for numerous applications in number theory, modular arithmetic, encryption algorithms such as RSA, random number generation, and solving linear Diophantine equations. High-performance algorithms, which are efficiently and accurately find the GCD of two integers and n (>2) integers, are needed in the modern world for many applications in science and engineering. parallel hardware and parallel programming solve such emerging challenges that are not possible in a serial world. Modern desktop and laptop computers are equipped with multicore processors with shared memory architecture. In this paper, we develop novel efficient parallel n integers GCD algorithms for multicore shared memory architecture. The brute force, divide-and-conquer, linear recursive and finding minimum first techniques are adopted in our novel algorithms to reduce the size and the complexity of the n integers GCD problem. Various working models of OpenMP, such as the thread-centric, loop-centric and task-centric models are utilized, which promised a more natural way of exploiting and expressing regular and irregular algorithms. A comprehensive performance analysis applies to prove the efficiency of the proposed algorithms.
PANDA is a future hadron and nuclear physics experiment at the FAIR facility in construction in Darmstadt, Germany. Unlike the majority of current experiments, PANDA's strategy for data acquisition is based on onl...
详细信息
Polyspectral estimation is a problem of great importance in the analysis of nonlinear time series that has applications in biomedical signal processing, communications, geophysics, image, radar, sonar and speech proce...
详细信息
Hierarchical matrices (H-matrices) are an approximation technique for dense matrices, such as the coefficient matrix of the boundary element method (BEM). An H-matrix is expressed by a set of low-rank approximated and...
详细信息
Hierarchical matrices (H-matrices) are an approximation technique for dense matrices, such as the coefficient matrix of the boundary element method (BEM). An H-matrix is expressed by a set of low-rank approximated and small dense sub-matrices, each of which has various ranks. The use of H-matrices reduces the required memory footprint of dense matrices from O(N^2) to O(NlogN) and is suitable for many-core processors that have relatively small memory capacities compared to traditional CPUs. However, existing parallel adaptive cross approximation (ACA) algorithms, which are low-rank approximation algorithms used to construct H-matrices, are not designed to exploit many-core processors in terms of load balancing. In existing parallel algorithms, the ACA process is independently applied to each sub-matrix. The computational load of the ACA process for each sub-matrix depends on the sub-matrix's rank; however, the rank is defined after the ACA process is applied. This makes it difficult to balance the load. We propose load-balancing-aware parallel ACA algorithms for H-matrices that focus on many-core processors. We implemented the proposed algorithms into HACApK, which is an open-source H-matrix library originally developed for CPU-based clusters. The proposed algorithms were evaluated using BEM problems on an NVIDIA Tesla P100 GPU (P100) and an Intel Xeon Broadwell processor. The evaluation results demonstrate the improved performance of the proposed algorithms in all GPU cases. For example, in a case where it is difficult for existing parallel algorithms to balance the load, the proposed algorithms achieved a 12.9 times performance improvement for the P100.
In this paper, new parallel algorithms based on the Coarse-Grained Multicomputer (CGM) model for solving the Longest Common Subsequence (LCS) problem with a string exclusion constraint (STR-EC-LCS) is presented. Based...
详细信息
ISBN:
(纸本)9781538678800
In this paper, new parallel algorithms based on the Coarse-Grained Multicomputer (CGM) model for solving the Longest Common Subsequence (LCS) problem with a string exclusion constraint (STR-EC-LCS) is presented. Based on a previous sequential algorithm, we propose two CGM parallel algorithms for STR-EC-LCS problem. We perform an experimental study of our two solutions to validate our theoretical predictions, and conclude that our first algorithm that minimizes idleness of the processors is better than the second. To the best of our knowledge, these algorithms are the first CGM-based parallel algorithms for the generalized-constrained-LCS problem.
The complete Voronoi map of a binary image with black and white pixels is a matrix of the same size such that each element is the closest black pixel of the corresponding pixel. The complete Voronoi map visualizes the...
详细信息
ISBN:
(纸本)9781538610428
The complete Voronoi map of a binary image with black and white pixels is a matrix of the same size such that each element is the closest black pixel of the corresponding pixel. The complete Voronoi map visualizes the influence region of each black pixel. However, each region may not be connected due to exclave pixels. The connected Voronoi map is a modification of the complete Voronoi map so that all regions are connected. The Euclidean distance map of a binary image is a matrix, in which each element is the distance to the closest black pixel. It has many applications of image processing such as dilation, erosion, blurring effects, skeletonization and matching. The main contribution of this paper is to present simple and fast parallel algorithms for computing the complete/connected Voronoi maps and the Euclidean distance map and implement them in the GPU. Our parallel algorithm first computes the mixed Voronoi map, which is a mixture of the complete and connected Voronoi maps, and then converts it into the complete/connected Voronoi by exposing/hiding all exclave pixels. After that, the complete Voronoi map is converted into the Euclidean distance map by computing the distance to the closest black pixel for every pixel in an obvious way. The experimental results on GeForce GTX 1080 GPU show that the computing time for these conversions is relatively small. The throughput of our GPU implementation for computing the Euclidean distance maps of 2K x 2K binary images is up to 2.08 times larger than the previously published best GPU implementation, and up to 172 times larger than CPU implementation using Intel Core i7-4790.
Many randomized algorithms can be derandomized efficiently using either the method of conditional expectations or probability spaces with low (almost-) independence. A series of papers, beginning with work by Luby (19...
详细信息
ISBN:
(纸本)9781611974782
Many randomized algorithms can be derandomized efficiently using either the method of conditional expectations or probability spaces with low (almost-) independence. A series of papers, beginning with work by Luby (1988) and continuing with Berger & Rompel (1991) and Chari et al. (1994), showed that these techniques can be combined to give deterministic parallel algorithms for combinatorial optimization problems involving sums of w-juntas. We improve these algorithms through derandomized variable partitioning. This reduces the processor complexity to essentially independent of w while the running time is reduced from exponential in w to linear in w. For example, we improve the time complexity of an algorithm of Berger & Rompel (1991) for rainbow hypergraph coloring by a factor of approximately log(2) n and the processor complexity by a factor of approximately m(ln2). As a major application of this, we give an NC algorithm for the Lovasz Local Lemma Previous NC algorithms, including the seminal algorithm of Moser & Tardos (2010) and the work of Chandrasekaran et. al (2013), required that (essentially) the bad-events could span only O(log n) variables;we relax this to allowing polylog(n) variables. As two applications of our new algorithm, we give algorithms for defective vertex coloring and domatic graph partition. One main sub-problem encountered in these algorithms is to generate a probability space which can "fool" a given list of GF(2) Fourier characters. Schulman (1992) gave an NC algorithm for this;we dramatically improve its efficiency to near-optimal time and processor complexity and code dimension. This leads to a new algorithm to solve the heavy-codeword problem, introduced by Naor & Naor (1993), with a near-linear processor complexity (mn)(l+o(1)).
暂无评论