SPPARKS is an open-source parallel simulation code for developing and running various kinds of on-lattice Monte Carlo models at the atomic or meso scales. It can be used to study the properties of solid-state material...
详细信息
SPPARKS is an open-source parallel simulation code for developing and running various kinds of on-lattice Monte Carlo models at the atomic or meso scales. It can be used to study the properties of solid-state materials as well as model their dynamic evolution during processing. The modular nature of the code allows new models and diagnostic computations to be added without modification to its core functionality, including its parallel algorithms. A variety of models for microstructural evolution (grain growth), solid-state diffusion, thin film deposition, and additive manufacturing (AM) processes are included in the code. SPPARKS can also be used to implement grid-based algorithms such as phase field or cellular automata models, to run either in tandem with a Monte Carlo method or independently. For very large systems such as AM applications, the Stitch I/O library is included, which enables only a small portion of a huge system to be resident in memory. In this paper we describe SPPARKS and its parallel algorithms and performance, explain how new Monte Carlo models can be added, and highlight a variety of applications which have been developed within the code.
Successive relaxation iterative algorithm (SOR) is a common iterative algorithm for solving linear symmetric transformation equations. When the coefficient matrix is positive, it has faster convergence speed. However,...
详细信息
Computing routing schemes that support both high throughput and low latency is one of the core challenges of network optimization. Such routes can be formalized as h-length flows which are defined as flows whose flow ...
详细信息
ISBN:
(纸本)9781450399135
Computing routing schemes that support both high throughput and low latency is one of the core challenges of network optimization. Such routes can be formalized as h-length flows which are defined as flows whose flow paths have length at most h. Many well-studied algorithmic primitives-such as maximal and maximum length-constrained disjoint paths-are special cases of h-length flows. Likewise the optimal h-length flow is a fundamental quantity in network optimization, characterizing, up to poly-log factors, how quickly a network can accomplish numerous distributed primitives. In this work, we give the first efficient algorithms for computing (1 - epsilon)-approximate h-length flows that are nearly "as integral as possible." We give deterministic algorithms that take (O) over tilde (poly(h, 1/epsilon)) parallel time and (O) over tilde (poly(h, 1/epsilon) center dot 2(O) (root log n)) distributed CONGEST time. We also give a CONGEST algorithm that succeeds with high probability and only takes (O) over tilde (poly(h, 1/epsilon)) time. Using our h-length flow algorithms, we give the first efficient deterministic CONGEST algorithms for the maximal disjoint paths problem with length constraints-settling an open question of Chang and Saranurak (FOCS 2020)-as well as essentially-optimal parallel and distributed approximation algorithms for maximum length-constrained disjoint paths. The former greatly simplifies deterministic CONGEST algorithms for computing expander decompositions. We also use our techniques to give the first efficient and deterministic (1-epsilon)-approximation algorithms for bipartite b-matching in CONGEST. Lastly, using our flow algorithms, we give the first algorithms to efficiently compute h-length cutmatches, an object at the heart of recent advances in length-constrained expander decompositions.
In today's world, Online Social Networks (OSNs) play a crucial role in our everyday life. But, its abuse to disseminate misinformation has turned out to be a major concern to us. Hence, the misinformation containm...
详细信息
ISBN:
(纸本)9781665477062
In today's world, Online Social Networks (OSNs) play a crucial role in our everyday life. But, its abuse to disseminate misinformation has turned out to be a major concern to us. Hence, the misinformation containment (MC) problem has attracted a lot of attention in recent times. For a given OSN with a fixed budget, this paper proposes a trust-based static technique independent of the distribution of misinformed nodes to select a set of trusted seed nodes leveraging the topologies of the network, to contain and decimate the misinformation faster. We follow a modified form of Competitive Linear Threshold Model with One Direction state Transition (LT1DT) to study the propagation dynamics of both the correct information and misinformation. Simulation studies on three real-world OSNs show that proposed method outperforms earlier work [1] significantly in terms of maximum number of misinformed nodes, infected time, point of inflection and number of misinformed nodes in steady state respectively. Moreover, its parallel implementation achieves almost 32x speedup, making the procedure scalable for large scale OSNs to contain and decimate misinformation in real-time.
In this paper, we present PARASOF, an algorithm for the solution of linear systems with BABD matrices on massively parallel computing systems like graphic processing units or GPUs. This algorithm is compared with the ...
详细信息
In this paper, we present PARASOF, an algorithm for the solution of linear systems with BABD matrices on massively parallel computing systems like graphic processing units or GPUs. This algorithm is compared with the state-of-the-art algorithms, in particular SOF, from which it is inspired and takes the same stability properties. We detail its design and implementation issues and give the main figures of its theoretical and experimental performances.
S-t connectivity is a decision problem asking, for vertices s and t in a graph, if t is reachable from s. Many parallel solutions for GPUs have been proposed in literature to solve the problem. The most efficient, whi...
详细信息
ISBN:
(纸本)9798350308600
S-t connectivity is a decision problem asking, for vertices s and t in a graph, if t is reachable from s. Many parallel solutions for GPUs have been proposed in literature to solve the problem. The most efficient, which rely on two concurrent BFS starting from s and t have shown limitations when applied on sparse graphs (i.e., graphs with low average degree). In this paper we present FAST-CON, an alternative solution based on multi-source BFS and adjacency matrix to better exploit the massive parallelism of the GPU architectures with any type of graph. The results show that FAST-CON achieves speedup up to one order of magnitude for dense graphs and up to two orders of magnitude for sparse graphs compared to the state of the art solutions.
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. This is often hard to achieve due to the ever-increasing volume of da...
详细信息
ISBN:
(纸本)9798350305487
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. This is often hard to achieve due to the ever-increasing volume of data used for training and testing. Bayesian approaches to DTs using Markov Chain Monte Carlo (MCMC) methods have demonstrated great accuracy in a wide range of applications. However, the inherently sequential nature of MCMC makes it unsuitable to meet both accuracy and scaling constraints. One could run multiple MCMC chains in an embarrassingly parallel fashion. Despite the improved runtime, this approach sacrifices accuracy in exchange for strong scaling. Sequential Monte Carlo (SMC) samplers are another class of Bayesian inference methods that also have the appealing property of being parallelizable without trading off accuracy. Nevertheless, finding an effective parallelization for the SMC sampler is difficult, due to the challenges in parallelizing its bottleneck, redistribution, in such a way that the workload is equally divided across the processing elements, especially when dealing with variable-size models such as DTs. This study presents a parallel SMC sampler for DTs on Shared Memory (SM) architectures, with an O(log(2) N) parallel redistribution for variable-size samples. On an SM machine mounting 32 cores, the experimental results show that our proposed method scales up to a factor of 16 compared to its serial implementation, and provides comparable accuracy to MCMC, but 51 times faster.
Massively-parallel, distributed-memory algorithms for the Lagrangian particle hydrodynamic method (Samulyak et al., 2018) have been developed, verified, and implemented. The key component of parallel algorithms is a p...
详细信息
Massively-parallel, distributed-memory algorithms for the Lagrangian particle hydrodynamic method (Samulyak et al., 2018) have been developed, verified, and implemented. The key component of parallel algorithms is a particle management module that includes a parallel construction of octree databases, dynamic adaptation and refinement of octrees, and particle migration between parallel subdomains. The particle management module is based on the p4est (parallel forest of k-trees) library. The massively-parallel Lagrangian particle code has been applied to a variety of fundamental science and applied problems. A summary of Lagrangian particle code applications to the injection of impurities into thermonuclear fusion devices and to the simulation of supersonic hydrogen jets in support of laser-plasma wakefield acceleration research has also been presented.
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faul...
详细信息
ISBN:
(纸本)9798400701559
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faults, or data-copy latency, affecting HPC applications' efficiency and scalability. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Interprocess MPI Collective design that maximizes small message MPI collective performance at scale. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead, and extra data copy, improving intra- and inter-node message rate and throughput. Our design also boosts performance for larger messages, resulting in comprehensive improvement for various message sizes. Experimental results show that PiP-MColl outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for MPI collectives like MPI_Scatter and MPI_Allgather.
Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within...
详细信息
Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the state-of-the-art algorithm for stateful single-row multiplication by using memristive partitions, reducing the latency of the previous state-of-the-art by 5.1x. In this brief, we begin by proposing novel partition-based computation techniques for broadcasting and shifting data. Then, we design an in-memory multiplication algorithm based on the carry-save add-shift (CSAS) technique. Finally, we develop a novel stateful full-adder that significantly improves the state-of-the-art (FELIX) design. These contributions constitute MultPIM, a multiplier that reduces state-of-the-art time complexity from quadratic to linear-log. For 32-bit numbers, MultPIM improves latency by an additional 4.2x over RIME, while even slightly reducing area overhead. Furthermore, we optimize MultPIM for full-precision matrix-vector multiplication and improve latency by 25.5x over FloatPIM matrix-vector multiplication.
暂无评论