For automata, synchronization, the problem of bringing an automaton to a particular state regardless of its initial state, is important. It has several applications in practice and is related to a fifty-year-old conje...
详细信息
For automata, synchronization, the problem of bringing an automaton to a particular state regardless of its initial state, is important. It has several applications in practice and is related to a fifty-year-old conjecture on the length of the shortest synchronizing word. Although using shorter words increases the effectiveness in practice, finding a shortest one (which is not necessarily unique) is NP-hard. For this reason, there exist various heuristics in the literature. However, high-quality heuristics such as SYNCHROP producing relatively shorter sequences are very expensive and can take hours when the automaton has tens of thousands of states. The SYNCHROP heuristic has been frequently used as a benchmark to evaluate the performance of the new heuristics. In this work, we first improve the runtime of SYNCHROP and its variants by using algorithmic techniques. We then focus on adapting SYNCHROP for many-core architectures, and overall, we obtain more than 1000x speedup on GPUs compared to naive sequential implementation that has been frequently used as a benchmark to evaluate new heuristics in the literature. We also propose two SYNCHROP variants and evaluate their performance.
With the emergence of Online Social Networks (OSNs) as an effective medium of information dissemination, its abuse in spreading misinformation has become a great concern to its users. Hence, the misinformation contain...
详细信息
With the emergence of Online Social Networks (OSNs) as an effective medium of information dissemination, its abuse in spreading misinformation has become a great concern to its users. Hence, the misinformation containment problem in various forms has emerged as an important topic of research. In general, given a snapshot of an online social network with a set of misinformed nodes and a budget limiting the maximum number of seed nodes, the goal is to determine a set of seed nodes with the correct information, to contain the misinformation at the earliest. In this paper, we leverage the community structure of the online social network to select the seed nodes statically, independent of the distribution of misinformed nodes for faster misinformation containment with simple one-time computation. We extend the work to include OSNs with overlapped community as well. To the best of our knowledge, so far, ours is the first work where the topology of the OSN has been exploited to combat the spread of misinformation faster. Experiments on real OSNs reveal that the proposed techniques outperform state-of-the-art algorithms significantly in terms of maximum and average infected time, and the point of decline, manifesting the key role of community structure on misinformation containment in a social network. Moreover, the parallel implementations of the proposed algorithms achieve around 10x speed-up over the sequential ones enhancing the scalability of the proposed approach. (C) 2020 Elsevier B.V. All rights reserved.
The Lovász Local Lemma (LLL) is a keystone principle in probability theory, guaranteeing the existence of configurations which avoid a collection B of "bad" events which are mostly independent and have ...
详细信息
Finding a minimum spanning tree of a graph is a well known problem in graph theory with many practical applications. We study serial variants of Prim's and Kruskal's algorithm and present their parallelization...
详细信息
ISBN:
(纸本)9789881925282
Finding a minimum spanning tree of a graph is a well known problem in graph theory with many practical applications. We study serial variants of Prim's and Kruskal's algorithm and present their parallelization targeting message passing parallel machine with distributed memory. We consider large graphs that can not fit into memory of one process. Experimental results show that Prim's algorithm is a good choice for dense graphs while Kruskal's algorithm is better for sparse ones. Poor scalability of Prim's algorithm comes from its high communication cost while Kruskal's algorithm showed much better scaling to larger number of processes.
In this work we present recent results on application of low-rank tensor decompositions to modelling of aggregation kinetics taking into account multi-particle collisions (for three and more particles). Such kinetics ...
详细信息
ISBN:
(纸本)9783030410322;9783030410315
In this work we present recent results on application of low-rank tensor decompositions to modelling of aggregation kinetics taking into account multi-particle collisions (for three and more particles). Such kinetics can be described by system of nonlinear differential equations with right-hand side requiring N D operations for its straight-forward evaluation, where N is number of particles' size classes and D is number of particles colliding simultaneously. Such a complexity can be significantly reduced by application low rank tensor decompositions (either Tensor Train or Canonical Polyadic) to acceleration of evaluation of sums and convolutions from right-hand side. Basing on this drastic reduction of complexity for evaluation of right-hand side we further utilize standard second order Runge-Kutta time integration scheme and demonstrate that our approach allows to obtain numerical solutions of studied equations with very high accuracy in modest times. We also show preliminary results on parallel scalability of novel approach and conclude that it can be efficiently utilized with use of supercomputers.
We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the ...
详细信息
We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions-introduced by Zolotarev (ZOLO) in 1877-this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3x speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.
The present paper develops a fast method to simulate the solidification structure of continuous billets with Cellular Automaton (CA) model. Traditional solution of the CA model on single CPU takes a long time for the ...
详细信息
The present paper develops a fast method to simulate the solidification structure of continuous billets with Cellular Automaton (CA) model. Traditional solution of the CA model on single CPU takes a long time for the massive datasets and complicated calculations, making it unrealistic to optimize the parameters through numerical simulation. In this paper, a parallel method based on Graphics Processing Units (GPU) was proposed to accelerate the calculation, which developed new algorithms for the solute redistribution and neighbor capture to avoid data race in parallel computing. This new method was applied to simulate the solidification structure of Fe0.64C alloy, and the simulating results were in good agreement with the experiment results with the same parameters. The absolute computational time for the fast method implemented on Tesla P100 GPU is 277 s, while the traditional method implemented on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz with single core is 24.57 h. The speedup, ratio between the absolute computational time of GPU-CA and CPU-CA, varies from 300 to 400 with the increase of the grids.
The aim of this paper is to develop new optimized Schwarz algorithms for the one dimensional Schrodinger equation with linear and nonlinear potential. The classical algorithm is an iterative process. In case of time-i...
详细信息
The aim of this paper is to develop new optimized Schwarz algorithms for the one dimensional Schrodinger equation with linear and nonlinear potential. The classical algorithm is an iterative process. In case of time-independent linear potential, we construct explicitly the interface problem and use direct LU method on the interface problem. The algorithm therefore turns to be a direct process. Thus, the algorithm is independent of transmission condition and the numerical computation is smaller. To our knowledge, this is the first time that the Schwarz algorithm is constructed as direct process. Concerning the case of time-dependent linear potential or nonlinear potential, we propose to use a pre-processed linear operator as preconditioner which leads to a preconditioned algorithm. Numerically, the convergence is also independent of the transmission condition. In addition, both of these new algorithms implemented in parallel cluster are robust, scalable up to 256 sub domains (MPI process) and take much less computation time than the classical one, especially for the nonlinear case. (C) 2020 Elsevier B.V. All rights reserved.
We consider a multi-agent setting with agents exchanging information over a possibly time-varying network, aiming at minimising a separable objective function subject to constraints. To achieve this objective we propo...
详细信息
We consider a multi-agent setting with agents exchanging information over a possibly time-varying network, aiming at minimising a separable objective function subject to constraints. To achieve this objective we propose a novel subgradient averaging algorithm that allows for non-differentiable objective functions and different constraint sets per agent. Allowing different constraints per agent simultaneously with a time-varying communication network constitutes a distinctive feature of our approach, extending existing results on distributed subgradient methods. To highlight the necessity of dealing with a different constraint set within a distributed optimisation context, we analyse a problem instance where an existing algorithm does not exhibit a convergent behaviour if adapted to account for different constraint sets. For our proposed iterative scheme we show asymptotic convergence of the iterates to a minimum of the underlying optimisation problem for step sizes of the form eta/k+1, eta > 0. We also analyse this scheme under a step size choice of eta/root k+1, eta > 0, and establish a convergence rate of O(ln k/root k) in objective value. To demonstrate the efficacy of the proposed method, we investigate a robust regression problem and an l(2) regression problem with regularisation. (C) 2021 Elsevier Ltd. All rights reserved.
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the s...
详细信息
ISBN:
(纸本)9781538683866
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the sequential and parallel settings. Several recent sequential and parallel algorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. For this reason, it can be noticed that the time spent by the current parallel graph connectivity algorithms on BFS operations is usually a significant portion of their overall runtime. In this paper, we study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of relying on BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. Our experiments on finding the 2, and 3-connectivity of graphs on Nvidia K40c GPUs improve the state-of-the-art on the corresponding problems by a factor 2.2x, and 2.1x respectively.
暂无评论