The problem of designing efficient parallel algorithms to calculate the product of n numbers when the multipliers are large is a fundamental problem in many applications of computer science such as cryptography. In th...
详细信息
This paper presents how state-of-the-art parallel algorithms designed to solve the Satisfiability (SAT) problem can be applied in the domain of product configuration. During an interactive configuration process, a use...
详细信息
Nowadays, the scale of a community is expanding and the regional transportation network is ever more intricate against the backdrop of the population explosion. It posed an obstacle to people especially guests in find...
详细信息
Complex systems simulations are well suited to the SIMT paradigm of GPUs, enabling millions of actors to be processed in fractions of a second. At the core of many such simulations, fixed radius near neighbours (FRRN)...
详细信息
Complex systems simulations are well suited to the SIMT paradigm of GPUs, enabling millions of actors to be processed in fractions of a second. At the core of many such simulations, fixed radius near neighbours (FRRN) search provides the actors with spatial awareness of their neighbours. The FRNN search process is frequently the limiting factor of performance, due to the disproportionate level of scattered memory reads demanded by the query stage, leading to FRNN search runtimes exceeding that of simulation logic. In this paper, we propose and evaluate two novel optimisations (Strips and Proportional Bin Width) for improving the performance of uniform spatially partitioned FRNN searches and apply them in combination to demonstrate the impact on the performance of multi-agent simulations. The two approaches aim to reduce latency in search and reduce the amount of data considered (i.e. more efficient searching), respectively. When the two optimisations are combined, the peak obtained speedups observed in a benchmark model are 1.27x and 1.34x in two and three dimensional implementations, respectively. Due to additional non FRNN search computation, the peak speedup obtained when applied to complex system simulations within FLAMEGPU is 1.21x. (C) 2019 The Authors. Published by Elsevier Inc.
Recently, a new trend has emerged in the field of parallel and high performance computing, the hybrid implementation using CPU-GPU modules. In such implementations, the computational load is shared between the CPU and...
详细信息
Recently, a new trend has emerged in the field of parallel and high performance computing, the hybrid implementation using CPU-GPU modules. In such implementations, the computational load is shared between the CPU and GPU, in order to improve the computational efficiency. However, the task of sharing the computational load between the two modules is a rather difficult one, with a number of limitations being imposed. This paper extends our recent work on community detection, which is based on transforming a network of nodes into a set of threaded binary trees. In this work, we share the computational load between the two units: the CPU takes specific samples of the network communities and organizes them in the form of threaded binary trees. The GPU takes over the heavy load of reading this data and transforming it into a path-matrix. Finally, this matrix is sent back to the CPU for analysis, community detection and overlaps, as well as network information upgrades. Our simulation results show significant improvement over our previous strategy and other known community detection strategies found in the literature.
As distributed energy resources (DERs) becomes widespread in power system, distributed algorithms are required for economic dispatch. Renewable generators accrue along with the randomness and volatility of the generat...
详细信息
Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation in processing graphs. Recently, size, variety, and structural complexity of these networks ...
详细信息
Partitioning graphs into blocks of roughly equal size such that few edges run between blocks is a frequently needed operation in processing graphs. Recently, size, variety, and structural complexity of these networks has grown dramatically. Unfortunately, previous approaches to parallel graph partitioning have problems in this context since they often show a negative trade-off between speed and quality. We present an approach to multi-level shared-memory parallel graph partitioning that produces balanced solutions, shows high speedups for a variety of large graphs and yields very good quality independently of the number of cores used. For example, in an extensive experimental study, at 79 cores, one of our closest competitors is faster but fails to meet the balance criterion in the majority of cases and another is mostly slower and incurs about 13 percent larger cut size. Important ingredients include parallel label propagation for both coarsening and refinement, parallel initial partitioning, a simple yet effective approach to parallel localized local search, and fast locality preserving hash tables.
We design and implement an efficient parallel algorithm for finding a perfect matching in a weighted bipartite graph such that weights on the edges of the matching are large. This problem differs from the maximum weig...
详细信息
We design and implement an efficient parallel algorithm for finding a perfect matching in a weighted bipartite graph such that weights on the edges of the matching are large. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization. Due to the lack of scalable alternatives, distributed solvers use sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64. To overcome this limitation, we propose a fully parallel distributed memory algorithm that first generates a perfect matching and then iteratively improves the weight of the perfect matching by searching for weight-increasing cycles of length 4 in parallel. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.
This paper proposes a class of graph association rules, denoted by GARs, to specify regularities between entities in graphs. A GAR is a combination of a graph pattern and a dependency;it may take as predicates ML (mac...
详细信息
This paper proposes a class of graph association rules, denoted by GARs, to specify regularities between entities in graphs. A GAR is a combination of a graph pattern and a dependency;it may take as predicates ML (machine learning) classifiers for link prediction. We show that GARs help us catch incomplete information in schemaless graphs, predict links in social graphs, identify potential customers in digital marketing, and extend graph functional dependencies (GFDs) to capture both missing links and inconsistencies. We formalize association deduction with GARs in terms of the chase, and prove its Church-Rosser property. We show that the satisfiability, implication and association deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively, retaining the same complexity bounds as their GFD counterparts, despite the increased expressive power of GARs. The incremental deduction problem is DP-complete for GARs versus coNP-complete for GFDs. In addition, we provide parallel algorithms for association deduction and incremental deduction. Using real-life and synthetic graphs, we experimentally verify the effectiveness, scalability and efficiency of the parallel algorithms.
暂无评论