Given a temporal graph G, a source vertex s, and a departure time at source vertex ts, the earliest arrival time problem (EAT) is to start from s on or after ts and reach all the vertices in G as early as possible. Ni...
详细信息
ISBN:
(纸本)9781665422925
Given a temporal graph G, a source vertex s, and a departure time at source vertex ts, the earliest arrival time problem (EAT) is to start from s on or after ts and reach all the vertices in G as early as possible. Ni et al. have proposed a parallel algorithm for EAT and obtained a speedup up to 9.5x on real-world graphs with respect to the connection-scan serial algorithm by using multi-core processors. We propose a topology-driven parallel algorithm for EAT on public transport networks and implement using general-purpose programming on the graphics processing unit (GPU). A temporal connection in a temporal graph for a public transport network is associated with a departure time and a duration time, and many connections exist from u to v for an edge (u, v). We propose two pruning techniques connection-type and clustering, and use arithmetic progression technique appropriately to process many connections of an edge, without scanning all of them. In the connection-type technique, the connections of an edge with the same duration are grouped together. In the clustering technique, we follow 24-hour format and the connections of an edge are partitioned into 24 clusters so that the departure time of connections in the ith cluster is at least i-hour and at most i + 1-hour. The arithmetic progression technique helps to store a sequence of departure times of various connections in a compact way. We propose a hybrid approach to combine the three techniques (connection-type, clustering and arithmetic progression) in an efficient way. Our techniques achieve an average speedup up to 61x when compared to the existing connection-scan serial algorithm running on CPU. Also, the average speedup of our algorithm is 12.65x against the parallel edge-scan-dependency graph algorithm running on GPU.
Simulation of quantum systems is challenging due to the exponential size of the state space. Tensor networks provide a systematically improvable approximation for quantum states. 2D tensor networks such as Projected E...
详细信息
ISBN:
(纸本)9781728199986
Simulation of quantum systems is challenging due to the exponential size of the state space. Tensor networks provide a systematically improvable approximation for quantum states. 2D tensor networks such as Projected Entangled Pair States (PEPS) are well-suited for key classes of physical systems and quantum circuits. However, direct contraction of PEPS networks has exponential cost, while approximate algorithms require computations with large tensors. We propose new scalable algorithms and software abstractions for PEPS-based methods, accelerating the bottleneck operation of contraction and refactorization of a tensor subnetwork. We employ randomized SVD with an implicit matrix to reduce cost and memory footprint asymptotically. Further, we develop a distributed-memory PEPS library and study accuracy and efficiency of alternative algorithms for PEPS contraction and evolution on the Stampede2 supercomputer. We also simulate a popular near-term quantum algorithm, the Variational Quantum Eigensolver (VQE), and benchmark Imaginar:k Time Evolution (ITE), which compute ground states of Hamiltonians.
Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, pa...
详细信息
Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers, which all happen to be of the so-called Conflict-Directed Clause Learning variety. We show the potential for speedups of up to 382x across a variety of problem instances. We hope that these results will stimulate future research, particularly with respect to a computer architecture open problem we present.
We present an MPI + OpenACC implementation of the kernel-independent barycentric Lagrange treecode (BLTC) for fast summation of particle interactions on GPUs. The distributed memory parallelization uses recursive coor...
详细信息
ISBN:
(数字)9781728174457
ISBN:
(纸本)9781728174457
We present an MPI + OpenACC implementation of the kernel-independent barycentric Lagrange treecode (BLTC) for fast summation of particle interactions on GPUs. The distributed memory parallelization uses recursive coordinate bisection for domain decomposition and MPI remote memory access to build locally essential trees on each rank. The particle interactions are organized into target batch/source cluster interactions which efficiently map onto the GPU;target batching provides an outer level of parallelism, while the direct sum form of the barycentric particle-cluster approximation provides an inner level of parallelism. The GPU-accelerated BLTC performance is demonstrated on several test cases up to 1 billion particles interacting via the Coulomb potential and Yukawa potential.
Machine Learning algorithms try to provide an adequate forecast for predicting and understanding a multitude of phenomena. However, due to the chaotic nature of real systems, it is very difficult to predict data: a sm...
详细信息
ISBN:
(纸本)9783030504335;9783030504328
Machine Learning algorithms try to provide an adequate forecast for predicting and understanding a multitude of phenomena. However, due to the chaotic nature of real systems, it is very difficult to predict data: a small perturbation from initial state can generate serious errors. Data Assimilation is used to estimate the best initial state of a system in order to predict carefully the future states. Therefore, an accurate and fast Data Assimilation can be considered a fundamental step for the entire Machine Learning process. Here, we deal with the Gaussian convolution operation which is a central step of the Data Assimilation approach and, in general, in several data analysis procedures. In particular, we propose a parallel algorithm, based on the use of Recursive Filters to approximate the Gaussian convolution in a very fast way. Tests and experiments confirm the efficiency of the proposed implementation.
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HL...
详细信息
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HLA) specification a standard framework for interoperability among simulators includes a Data Distribution Management (DDM) service whose responsibility is to report all intersections between a set of subscription and update regions. The algorithms at the core of the DDM service are CPU-intensive, and could greatly benefit from the large computing power of modern multi-core processors. In this article, we propose two parallel solutions to the DDM problem that can operate effectively on shared-memory multiprocessors. The first solution is based on a data structure (the interval tree) that allows concurrent computation of intersections between subscription and update regions. The second solution is based on a novel parallel extension of the Sort Based Matching algorithm, whose sequential version is considered among the most efficient solutions to the DDM problem. Extensive experimental evaluation of the proposed algorithms confirm their effectiveness on taking advantage of multiple execution units in a shared-memory architecture.
We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse block...
详细信息
We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse blocks. We describe these algorithms and show that they always execute the full eight-octant (or four-quadrant if 2D) sweep in the minimum possible number of stages for a given P-x x P-y x P-z, partitioning. Computational results confirm that our optimal scheduling algorithms execute sweeps in the minimum possible stage count. Observed parallel efficiencies agree well with our performance model. Our PDT transport code has achieved approximately 68% parallel efficiency with > 1.5M parallel threads, relative to 8 threads, on a simple weak-scaling problem with only three energy groups, 10 directions per octant, and 4096 cells/thread. Our ARDRA code has achieved 71% efficiency with > 1.5M cores, relative to 16 cores, with 36 directions per octant and 48 energy groups. We demonstrate similar efficiencies with PDT on a realistic set of nuclear-reactor test problems, with unstructured meshes that resolve fine geometric details. These results demonstrate that discrete-ordinates transport sweeps can be executed with high efficiency using more than 10(6) parallel processes. (C) 2020 Published by Elsevier Inc.
The alpha complex, a subset of the Delaunay triangulation, has been extensively used as the underlying representation for biomolecular structures. We propose a GPU-based parallel algorithm for the computation of the a...
详细信息
The alpha complex, a subset of the Delaunay triangulation, has been extensively used as the underlying representation for biomolecular structures. We propose a GPU-based parallel algorithm for the computation of the alpha complex, which exploits the knowledge of typical spatial distribution and sizes of atoms in a biomolecule. Unlike existing methods, this algorithm does not require prior construction of the Delaunay triangulation. The algorithm computes the alpha complex in two stages. The first stage proceeds in a bottom-up fashion and computes a superset of the edges, triangles, and tetrahedra belonging to the alpha complex. The false positives from this estimation stage are removed in a subsequent pruning stage to obtain the correct alpha complex. Computational experiments on several biomolecules demonstrate the superior performance of the algorithm, up to a factor of 50 when compared to existing methods that are optimized for biomolecules. (C) 2020 Elsevier B.V. All rights reserved.
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary t...
详细信息
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary to identify communities of interrelated nodes within the networks. Recently, a fuzzy community detection approach based on the minimization of a topological error functional has been proposed in the form of a gradient-based algorithm design pattern. However, the intrinsic quadratic algorithmic complexity of the procedure limits the problem size that can be efficiently treated. Here, we extend the ability of this approach to analyze larger networks resorting to parallelism. Thus, we identify the concurrency sources in the gradient-based algorithm design pattern. To determine the parallelization limits, we develop a two-dimensional performance model as a function of the number of processors and network size. The model permits to compute the maximum possible speedup. Another model is presented to find the maximum problem size tractable in a given amount of time. Application of the previous models to a set of benchmark networks shows that parallelization enhances the proposed fuzzy community detection approach in more than an order of magnitude. This allows treatment of networks with several hundred thousand nodes in a time frame of hours.
In this work we deal with the solution of a two-dimensional inverse time fractional diffusion equation, involving a Caputo fractional derivative in his expression. Since we deal with a huge practical problem with a la...
详细信息
ISBN:
(纸本)9783030390815
In this work we deal with the solution of a two-dimensional inverse time fractional diffusion equation, involving a Caputo fractional derivative in his expression. Since we deal with a huge practical problem with a large domain, by starting from an accurate meshless localized collocation method using RBFs, here we propose a fast algorithm, implemented in a multicore architecture, which exploits suitable parallel computational kernels. More in detail, we firstly developed, a C code based on the numerical library LAPACK to perform the basic linear algebra operations and to solve linear systems, then, due to the high computational complexity and the large size of the problem, we propose a parallel algorithm specifically designed for multicore architectures and based on the Pthreads library. Performance analysis will show accuracy and reliability of our parallel implementation.
暂无评论