Community detection is an important data clustering technique for studying graph structures. Many serial algorithms have been developed and well studied in the literature. As the problem size grows, the research atten...
详细信息
ISBN:
(纸本)9783319413211;9783319413204
Community detection is an important data clustering technique for studying graph structures. Many serial algorithms have been developed and well studied in the literature. As the problem size grows, the research attention has recently been turning to parallelizing the technique. However, the conventional parallelization strategies that divide the problem domain into non-overlapping subdomains do not scale with problem size and the number of processes. The main obstacle lies in the fact that the graph algorithms often exhibit a high degree of data dependency, which makes developing scalable parallel algorithms a great challenge. We present PMEP, a distributed-memory based parallel community detection algorithm that adopts an unconventional data partitioning strategy. PMEP divides a graph into subgraphs and assigns each pair of subgraphs to one process. This method duplicates a portion of computational workload among processes in exchange for a significantly reduced communication cost required in the later stages. After data partitioning, each process runs MEP on the assigned subgraph pair. MEP is a community detection algorithm based on the idea of maximizing equilibrium and purity. Our data partitioning method effectively simplifies the communication required for combining the local results into a global one and hence allows us to achieve better scalability over existing parallel algorithms without sacrificing the result quality. Our experimental results show a speedup of 126.95 on 190 MPI processes for using synthetic data sets and a speedup of 204.22 on 1225 processes for using a real-world data set.
The aim of this paper is to develop a theoretical framework for training neural network (NN) models, when data is distributed over a set of agents that are connected to each other through a sparse network topology. Th...
详细信息
ISBN:
(纸本)9781509007462
The aim of this paper is to develop a theoretical framework for training neural network (NN) models, when data is distributed over a set of agents that are connected to each other through a sparse network topology. The framework builds on a distributed convexification technique, while leveraging dynamic consensus to propagate the information over the network. It can be customized to work with different loss and regularization functions, typically used when training NN models, while guaranteeing provable convergence to a stationary solution under mild assumptions. Interestingly, it naturally leads to distributed architectures where agents solve local optimization problems exploiting parallel multi-core processors. Numerical results corroborate our theoretical findings, and assess the performance for parallel and distributed training of neural networks.
The main contribution of this paper is to present an FPGA-targeted architecture called the hierarchical GCD cluster, that computes the GCDs of all pairs in a set of numbers. It is designed based on the FDFM (Few DSP s...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
The main contribution of this paper is to present an FPGA-targeted architecture called the hierarchical GCD cluster, that computes the GCDs of all pairs in a set of numbers. It is designed based on the FDFM (Few DSP slices and Few Memory blocks) approach and consists of 1408 processors equipped with one block RAM and one DSP slice each. Every processor works in parallel and computes the GCDs independently. We have measured the performance of our architecture to compute all pairs of two numbers in RSA moduli. Implementation results show that it runs 0.057 mu s per one GCD computation of two 1024-bit RSA moduli in a Xilinx Virtex-7 family FPGA XC7VX485T-2. It is 6.0 times faster than the best GPU implementation and 500 times faster than a sequential implementation on the Intel Xeon CPU.
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, ...
详细信息
ISBN:
(纸本)9781467387767
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, that is written in a certain format. It is a distributed dynamic scheduling algorithm, i.e. each node schedules the load of its cores, it can be used with different programming platforms and architectures and is a hybrid algorithm (OpenMP, MPI). To prove its validity and efficiency the proposed algorithm has been implemented and tested with numerous examples in this paper that are described in detail. A speed-up of about 9 has been achieved for the tested examples, for a cluster of three nodes with four cores each.
The induction of a minimal nondeterministic finite automaton (NFA) consistent with a given set of examples and counter examples, which is known to be computationally hard, is discussed. The paper is an extension to th...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
The induction of a minimal nondeterministic finite automaton (NFA) consistent with a given set of examples and counter examples, which is known to be computationally hard, is discussed. The paper is an extension to the novel approach of transforming the problem of NFA induction into the integer nonlinear programming (INLP) problem. An improved formulation of the problem is proposed along with the two parallel algorithms to solve it. The methods for the distribution of tasks among processors along with distributed termination detection are presented. The experimental results for selected benchmarks are also reported.
In Moving Horizon Estimation (MHE) the computed estimate is found by solving a constrained finite-time optimal estimation problem in real-time at each sample in a receding horizon fashion. The constrained estimation p...
详细信息
ISBN:
(纸本)9781509025916
In Moving Horizon Estimation (MHE) the computed estimate is found by solving a constrained finite-time optimal estimation problem in real-time at each sample in a receding horizon fashion. The constrained estimation problem can be solved by, e.g., interior-point (IP) or active-set (As) methods, where the main computational effort in both methods is known to be the computation of the search direction, i.e., the Newton step. This is often done using generic sparsity exploiting algorithms or serial Riccati recursions, but as parallel hardware is becoming more commonly available the need for parallel algorithms for computing the Newton step is increasing. In this paper a newly developed tailored, non-iterative parallel algorithm for computing the Newton step using the Riccati recursion for Model Predictive Control (MPC) is extended to MHE problems. The algorithm exploits the special structure of the Karush-Kuhn-Tucker system for the optimal estimation problem. As a result it is possible to obtain logarithmic complexity growth in the estimation horizon length, which can be used to reduce the computation time for IP and AS methods when applied to what is today considered as challenging estimation problems. Furthermore, promising numerical results have been obtained using an ANSI-C implementation of the proposed algorithm, which uses Message Passing Interface (MPI) together with InfiniBand and is executed on true parallel hardware. Beyond MHE, due to similarities in the problem structure, the algorithm can be applied to various forms of on-line and off-line smoothing problems.
We present a parallel algorithm for computing the approximate factorization of an N-by-N kernel matrix. Once this factorization has been constructed (with N log(2) N work), we can solve linear systems with this matrix...
详细信息
ISBN:
(纸本)9781509021406
We present a parallel algorithm for computing the approximate factorization of an N-by-N kernel matrix. Once this factorization has been constructed (with N log(2) N work), we can solve linear systems with this matrix with N logN work. Kernel matrices represent pairwise interactions of points in metric spaces. They appear in machine learning, approximation theory, and computational physics. Kernel matrices are typically dense (matrix multiplication scales quadratically with N) and ill-conditioned (solves can require 100s of Krylov iterations). Thus, fast algorithms for matrix multiplication and factorization are critical for scalability. Recently we introduced ASKIT, a new method, which resembles N-body methods, for approximating a kernel matrix. Here we introduce INV-ASKIT, a factorization scheme based on ASKIT. We describe the new method, derive complexity estimates, and conduct an empirical study of its accuracy and scalability. We report results on real-world datasets including "COVTYPE" (0.5M points in 54 dimensions), "SUSY" (4.5M points in 8 dimensions) and "MNIST" (2M points in 784 dimensions) using shared and distributed memory parallelism. In our largest run we approximately factorize a dense matrix of size 32M x 32M (generated from points in 64 dimensions) on 4,096 Sandy-Bridge cores. To our knowledge these results improve the state of the art by several orders of magnitude.
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel ma...
详细信息
ISBN:
(纸本)9781509021406
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants both theoretically and in practice for sparsedense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.
Ahead-of-time data layout optimization by vertex reordering is a widely used technique to improve memory access locality in graph analysis. While reordered graphs yield better analysis performance, the existing reorde...
详细信息
ISBN:
(纸本)9781509021406
Ahead-of-time data layout optimization by vertex reordering is a widely used technique to improve memory access locality in graph analysis. While reordered graphs yield better analysis performance, the existing reordering algorithms use significant amounts of computation time to provide efficient vertex ordering;hence, they fail to reduce end-to-end processing time. This paper presents a first algorithm for just-in-time parallel reordering, named Rabbit Order. It reduces end-to-end runtime by achieving high locality and fast reordering at the same time through two approaches. The first approach is hierarchical community-based ordering, which exploits the locality derived from hierarchical community structures in real-world graphs. Our ordering fully leverages low-latency cache levels by mapping hierarchical communities into hierarchical caches. The second approach is parallel incremental aggregation, which improves the runtime efficiency of reordering by decreasing the number of vertices to be processed. In addition, this approach utilizes lightweight atomic operations for concurrency control to avoid locking overheads and achieve high scalability. Our experiments show that Rabbit Order significantly outperforms state-of-the-art reordering algorithms.
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance ...
详细信息
ISBN:
(纸本)9781509034840
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance computer system simulating various fields and providing scalability on an arbitrary number of parallel central and graphical processors as well as distributed computer clusters. Many general-purpose computer simulation systems have limited usability when it comes to high-precision simulation associated with large numbers of elementary computations due to their lack of scalability on various parallel and distributed platforms. The more the required adequacy of the model is, the higher the numbers of steps of the simulation algorithms are. Scalability permits a use hybrid parallel computer systems and improves efficiency of the simulation with respect to adequacy, time consumptions, and total costs of simulation experiments. The report covers such an algorithm which is based on an approximate superposition of acoustical fields and provides adequate results, as long as the used equations of acoustics are linear. The algorithm represents reflecting surfaces as sets of vibrating pistons and uses the Rayleigh integral to calculate their scattering properties. The article also provides a parallel form of the algorithm and analysis of its properties in parallel and sequential forms.
暂无评论