The Lovász Local Lemma (LLL) is a keystone principle in probability theory, guaranteeing the existence of configurations which avoid a collection B of "bad" events which are mostly independent and have ...
详细信息
Finding a minimum spanning tree of a graph is a well known problem in graph theory with many practical applications. We study serial variants of Prim's and Kruskal's algorithm and present their parallelization...
详细信息
ISBN:
(纸本)9789881925282
Finding a minimum spanning tree of a graph is a well known problem in graph theory with many practical applications. We study serial variants of Prim's and Kruskal's algorithm and present their parallelization targeting message passing parallel machine with distributed memory. We consider large graphs that can not fit into memory of one process. Experimental results show that Prim's algorithm is a good choice for dense graphs while Kruskal's algorithm is better for sparse ones. Poor scalability of Prim's algorithm comes from its high communication cost while Kruskal's algorithm showed much better scaling to larger number of processes.
In this work we present recent results on application of low-rank tensor decompositions to modelling of aggregation kinetics taking into account multi-particle collisions (for three and more particles). Such kinetics ...
详细信息
ISBN:
(纸本)9783030410322;9783030410315
In this work we present recent results on application of low-rank tensor decompositions to modelling of aggregation kinetics taking into account multi-particle collisions (for three and more particles). Such kinetics can be described by system of nonlinear differential equations with right-hand side requiring N D operations for its straight-forward evaluation, where N is number of particles' size classes and D is number of particles colliding simultaneously. Such a complexity can be significantly reduced by application low rank tensor decompositions (either Tensor Train or Canonical Polyadic) to acceleration of evaluation of sums and convolutions from right-hand side. Basing on this drastic reduction of complexity for evaluation of right-hand side we further utilize standard second order Runge-Kutta time integration scheme and demonstrate that our approach allows to obtain numerical solutions of studied equations with very high accuracy in modest times. We also show preliminary results on parallel scalability of novel approach and conclude that it can be efficiently utilized with use of supercomputers.
We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the ...
详细信息
We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions-introduced by Zolotarev (ZOLO) in 1877-this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3x speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.
The present paper develops a fast method to simulate the solidification structure of continuous billets with Cellular Automaton (CA) model. Traditional solution of the CA model on single CPU takes a long time for the ...
详细信息
The present paper develops a fast method to simulate the solidification structure of continuous billets with Cellular Automaton (CA) model. Traditional solution of the CA model on single CPU takes a long time for the massive datasets and complicated calculations, making it unrealistic to optimize the parameters through numerical simulation. In this paper, a parallel method based on Graphics Processing Units (GPU) was proposed to accelerate the calculation, which developed new algorithms for the solute redistribution and neighbor capture to avoid data race in parallel computing. This new method was applied to simulate the solidification structure of Fe0.64C alloy, and the simulating results were in good agreement with the experiment results with the same parameters. The absolute computational time for the fast method implemented on Tesla P100 GPU is 277 s, while the traditional method implemented on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz with single core is 24.57 h. The speedup, ratio between the absolute computational time of GPU-CA and CPU-CA, varies from 300 to 400 with the increase of the grids.
The aim of this paper is to develop new optimized Schwarz algorithms for the one dimensional Schrodinger equation with linear and nonlinear potential. The classical algorithm is an iterative process. In case of time-i...
详细信息
The aim of this paper is to develop new optimized Schwarz algorithms for the one dimensional Schrodinger equation with linear and nonlinear potential. The classical algorithm is an iterative process. In case of time-independent linear potential, we construct explicitly the interface problem and use direct LU method on the interface problem. The algorithm therefore turns to be a direct process. Thus, the algorithm is independent of transmission condition and the numerical computation is smaller. To our knowledge, this is the first time that the Schwarz algorithm is constructed as direct process. Concerning the case of time-dependent linear potential or nonlinear potential, we propose to use a pre-processed linear operator as preconditioner which leads to a preconditioned algorithm. Numerically, the convergence is also independent of the transmission condition. In addition, both of these new algorithms implemented in parallel cluster are robust, scalable up to 256 sub domains (MPI process) and take much less computation time than the classical one, especially for the nonlinear case. (C) 2020 Elsevier B.V. All rights reserved.
We consider a multi-agent setting with agents exchanging information over a possibly time-varying network, aiming at minimising a separable objective function subject to constraints. To achieve this objective we propo...
详细信息
We consider a multi-agent setting with agents exchanging information over a possibly time-varying network, aiming at minimising a separable objective function subject to constraints. To achieve this objective we propose a novel subgradient averaging algorithm that allows for non-differentiable objective functions and different constraint sets per agent. Allowing different constraints per agent simultaneously with a time-varying communication network constitutes a distinctive feature of our approach, extending existing results on distributed subgradient methods. To highlight the necessity of dealing with a different constraint set within a distributed optimisation context, we analyse a problem instance where an existing algorithm does not exhibit a convergent behaviour if adapted to account for different constraint sets. For our proposed iterative scheme we show asymptotic convergence of the iterates to a minimum of the underlying optimisation problem for step sizes of the form eta/k+1, eta > 0. We also analyse this scheme under a step size choice of eta/root k+1, eta > 0, and establish a convergence rate of O(ln k/root k) in objective value. To demonstrate the efficacy of the proposed method, we investigate a robust regression problem and an l(2) regression problem with regularisation. (C) 2021 Elsevier Ltd. All rights reserved.
Abstract: The convergence and accuracy of a method for solving high-order accurate bicompact schemes having the fourth order of approximation in spatial variables on a minimum stencil for a multidimensional inhomogene...
详细信息
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the s...
详细信息
ISBN:
(纸本)9781538683866
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the sequential and parallel settings. Several recent sequential and parallel algorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. For this reason, it can be noticed that the time spent by the current parallel graph connectivity algorithms on BFS operations is usually a significant portion of their overall runtime. In this paper, we study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of relying on BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. Our experiments on finding the 2, and 3-connectivity of graphs on Nvidia K40c GPUs improve the state-of-the-art on the corresponding problems by a factor 2.2x, and 2.1x respectively.
Network of interactions among bio-molecules is fundamental to biological processes. Many works have shown that molecular networks can be analyzed by decomposing the networks into smaller modules named network motifs. ...
详细信息
暂无评论