In this work we formally derive and prove the correctness of the algorithms and data structures in a parallel, distributed-memory, generic finite element framework that supports h-adaptivity on computational domains r...
详细信息
In this work we formally derive and prove the correctness of the algorithms and data structures in a parallel, distributed-memory, generic finite element framework that supports h-adaptivity on computational domains represented as forest-of-trees. The framework is grounded on a rich representation of the adaptive mesh suitable for generic finite elements that is built on top of a low-level, light-weight forest-of-trees data structure handled by a specialized, highly parallel adaptive meshing engine, for which we have identified the requirements it must fulfill to be coupled into our framework. Atop this two-layered mesh representation, we build the rest of the data structures required for the numerical integration and assembly of the discrete system of linear equations. We consider algorithms that are suitable for both subassembled and fully assembled distributed data layouts of linear system matrices. The proposed framework has been implemented within the FEMPAR scientific software library, using p4est as a practical forest-of-octrees demonstrator. A strong scaling study of this implementation when applied to Poisson and Maxwell problems reveals remarkable scalability up to 32.2K CPU cores and 482.2M degrees of freedom. Besides, a comparative performance study of FEMPAR and the state-of-the-art deal. II finite element software shows at least comparative performance, and at most a factor of 2-3 improvement in the h-adaptive approximation of a Poisson problem with first- and second-order Lagrangian finite elements, respectively.
The work of this paper is to solve the Black-Scholes equation under European options based on the time parallel algorithm combined with the kansa method. Firstly, the partial differential equation of the price of deri...
详细信息
Hash tables are a fundamental data structure for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-...
详细信息
Hash tables are a fundamental data structure for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. This survey identifies key factors affecting the performance of different techniques and suggests directions for further research.
The stability of a social network has been widely studied as an important indicator for both the network holders and the participants. Existing works on reinforcing networks focus on a local view, e.g., the anchored k...
详细信息
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HL...
详细信息
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HLA) specification a standard framework for interoperability among simulators includes a Data Distribution Management (DDM) service whose responsibility is to report all intersections between a set of subscription and update regions. The algorithms at the core of the DDM service are CPU-intensive, and could greatly benefit from the large computing power of modern multi-core processors. In this article, we propose two parallel solutions to the DDM problem that can operate effectively on shared-memory multiprocessors. The first solution is based on a data structure (the interval tree) that allows concurrent computation of intersections between subscription and update regions. The second solution is based on a novel parallel extension of the Sort Based Matching algorithm, whose sequential version is considered among the most efficient solutions to the DDM problem. Extensive experimental evaluation of the proposed algorithms confirm their effectiveness on taking advantage of multiple execution units in a shared-memory architecture.
One of the simplest problems on directed graphs is that of identifying the set of vertices reachable from a designated source vertex. This problem can be solved easily sequentially by performing a graph search, but ef...
详细信息
One of the simplest problems on directed graphs is that of identifying the set of vertices reachable from a designated source vertex. This problem can be solved easily sequentially by performing a graph search, but efficient parallel algorithms have eluded researchers for decades. For sparse high-diameter graphs in particular, there is no known work-efficient parallel algorithm with nontrivial parallelism. This amounts to one of the most fundamental open questions in parallel graph algorithms: Is there a parallel algorithm for digraph reachability with nearly linear work? This article shows that the answer is yes, presenting a randomized parallel algorithm for digraph reachability and related problems with expected work o(m) and span (O) over tilde (n(2/3)), and hence parallelism (O) over tilde (m/n(2/3)) = (Omega) over tilde (n(1/3)), on any graph with n vertices and m arcs. This is the first parallel algorithm having both nearly linear work and strongly sublinear span, i.e., span (O) over tilde (n(1-is an element of)) for any constant is an element of > 0. The algorithm can be extended to produce a directed spanning tree, determine whether the graph is acyclic, topologically sort the strongly connected components of the graph, or produce a directed ear decomposition, all with work (O) over tilde (m) and span (O) over tilde (n(2/3)). The main technical contribution is an efficient Monte Carlo algorithm that, through the addition of a(n) shortcuts, reduces the diameter of the graph to (O) over tilde (n(2/3)) with high probability. While both sequential and parallel algorithms are known with those combinatorial properties, even the sequential algorithms are not efficient, having sequential runtime Omega(mn(Omega(1))). This article presents a surprisingly simple sequential algorithm that achieves the stated diameter reduction and runs in (O) over tilde (m) time. parallelizing that algorithm yields the main result, but doing so involves overcoming several other challen
Motivated by large-scale optimization problems arising in the context of machine learning, there have been several advances in the study of asynchronous parallel and distributed optimization methods during the past de...
详细信息
Motivated by large-scale optimization problems arising in the context of machine learning, there have been several advances in the study of asynchronous parallel and distributed optimization methods during the past decade. Asynchronous methods do not require all processors to maintain a consistent view of the optimization variables. Consequently, they generally can make more efficient use of computational resources than synchronous methods, and they are not sensitive to issues like stragglers (i.e., slow nodes) and unreliable communication links. Mathematical modeling of asynchronous methods involves proper accounting of information delays, which makes their analysis challenging. This article reviews recent developments in the design and analysis of asynchronous optimization methods, covering both centralized methods, where all processors update a master copy of the optimization variables, and decentralized methods, where each processor maintains a local copy of the variables. The analysis provides insights into how the degree of asynchrony impacts convergence rates, especially in stochastic optimization methods.
Herein, a parallel implementation in OpenMP of the Image Block Representation (IBR) for binary images is investigated. The IBR is a region-based image representation scheme that represents the binary image as a set of...
详细信息
Herein, a parallel implementation in OpenMP of the Image Block Representation (IBR) for binary images is investigated. The IBR is a region-based image representation scheme that represents the binary image as a set of non-overlapping rectangular areas with object level, called blocks. The IBR permits the execution of operations on image areas instead of image points and therefore leads to a substantial reduction of the required computational complexity. The experimental and the analytically derived results from parallel implementation in OpenMP, on a multicore computer, proved that a very good overall performance can be achieved. (C) 2019 Elsevier Inc. All rights reserved.
Edit distance has applications in many domains such as bioinformatics, spell checking, plagiarism checking, query optimization, speech recognition, and data mining. Traditionally, edit distance is computed by dynamic ...
详细信息
Edit distance has applications in many domains such as bioinformatics, spell checking, plagiarism checking, query optimization, speech recognition, and data mining. Traditionally, edit distance is computed by dynamic programming based sequential solution which becomes infeasible for large problems. In this paper, we introduce NvPD, a novel algorithm for parallel edit distance computation by resolving dependencies in the conventional dynamic programming based solution. We also establish the correctness of modified dependencies. NvPD exhibits certain characteristics such as balanced workload among processors, less synchronization overhead, maximum utilization of resources and it can exploit spatial locality. It requiresmin(m,n)steps to complete as compared to diagonal based approach that completes inmax(m,n) Experimental evaluation using variety of random and real life data sets over shared memory multi-core systems and graphic processing units (GPUs) show that NvPD outperforms state-of-the-art parallel edit distance algorithms.
Existing parallel algorithms for wavelet tree construction have a work complexity of O ( n log σ ) . This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is bas...
详细信息
Existing parallel algorithms for wavelet tree construction have a work complexity of O ( n log σ ) . This paper presents parallel algorithms for the problem with improved work complexity. Our first algorithm is based on parallel integer sorting and has either O ( n log log n ⌈ log σ / log n log log n ⌉ ) work and polylogarithmic depth, or O ( n ⌈ log σ / log n ⌉ ) work and sub-linear depth. We also describe another algorithm that has O ( n ⌈ log σ / log n ⌉ ) work and O ( σ + log n ) depth. We then show how to use similar ideas to construct variants of wavelet trees (arbitrary-shaped binary trees and multiary trees) as well as wavelet matrices in parallel with lower work complexity than prior algorithms. Finally, we show that the rank and select structures on binary sequences and multiary sequences, which are stored on wavelet tree nodes, can be constructed in parallel with improved work bounds, matching those of the best existing sequential algorithms for constructing rank and select structures.
暂无评论