We present an algorithm for the solution of a simultaneous space-time discretization of linear parabolic evolution equations with a symmetric differential operator in space. Building on earlier work, we recast this di...
详细信息
ISBN:
(数字)9783030759339
ISBN:
(纸本)9783030759339;9783030759322
We present an algorithm for the solution of a simultaneous space-time discretization of linear parabolic evolution equations with a symmetric differential operator in space. Building on earlier work, we recast this discretization into a Schur complement equation whose solution is a quasi-optimal approximation to the weak solution of the equation at hand. Choosing a tensor-product discretization, we arrive at a remarkably simple linear system. Using wavelets in time and standard finite elements in space, we solve the resulting system in linear complexity on a single processor, and in polylogarithmic complexity when parallelized in both space and time. We complement these theoretical findings with large-scale parallel computations showing the effectiveness of the method.
The Jaya algorithm is a recent heuristic approach for solving optimisation problems. It involves a random search for the global optimum, based on the generation of new individuals using both the best and the worst ind...
详细信息
The Jaya algorithm is a recent heuristic approach for solving optimisation problems. It involves a random search for the global optimum, based on the generation of new individuals using both the best and the worst individuals in the population, thus moving solutions towards the optimum while avoiding the worst current solution. In addition to its performance in terms of optimisation, a lack of control parameters is another significant advantage of this algorithm. However, the number of iterations needed to reach the optimal solution, or close to it, may be very high, and the computational cost can hamper compliance with time requirements. In this work, a chaotic two-dimensional (2D) map is used to accelerate convergence, and parallel algorithms are developed to alleviate the computational cost. Coarse- and fine-grained parallel algorithms are developed, the former based on multi-populations and the latter at the individual level, and in both cases these are accelerated by an improved (computational) use of the chaos map.
Many real-world problems such as internet routing are actually graph problems. To develop efficient solutions to such problems, more and more parallel graph algorithms are proposed. This paper discusses the mechanized...
详细信息
ISBN:
(数字)9783030888060
ISBN:
(纸本)9783030888060;9783030888053
Many real-world problems such as internet routing are actually graph problems. To develop efficient solutions to such problems, more and more parallel graph algorithms are proposed. This paper discusses the mechanized verification of a commonly used parallel graph algorithm, namely the Bellman-Ford algorithm, which provides an inherently parallel solution to the Single-Source Shortest Path problem. Concretely, we verify an unoptimized GPU version of the Bellman-Ford algorithm, using the VerCors verifier. The main challenge that we had to address was to find suitable global invariants of the graph-based properties for automated verification. This case study is the first deductive verification to prove functional correctness of the parallel Bellman-Ford algorithm. It provides the basis to verify other, optimized implementations of the algorithm. Moreover, it may also provide a good starting point to verify other parallel graph-based algorithms.
We consider the problem of nonnegative tensor completion. We adopt the alternating optimization framework and solve each nonnegative matrix completion problem via a stochastic variation of the accelerated gradient alg...
详细信息
ISBN:
(纸本)9789082797060
We consider the problem of nonnegative tensor completion. We adopt the alternating optimization framework and solve each nonnegative matrix completion problem via a stochastic variation of the accelerated gradient algorithm. We experimentally test the effectiveness and the efficiency of our algorithm using both real-world and synthetic data. We develop a shared-memory implementation of our algorithm using the multi-threaded API OpenMP, which attains significant speedup. We believe that our approach is a very competitive candidate for the solution of very large nonnegative tensor completion problems.
In this paper we show a deterministic parallel all-pairs shortest paths algorithm for real-weighted directed graphs. The algorithm has (O) over tilde (nm + (n/d)(3)) work and (O) over tilde (d) depth for any depth par...
详细信息
ISBN:
(纸本)9781611976465
In this paper we show a deterministic parallel all-pairs shortest paths algorithm for real-weighted directed graphs. The algorithm has (O) over tilde (nm + (n/d)(3)) work and (O) over tilde (d) depth for any depth parameter d is an element of [1;n]. To the best of our knowledge, such a trade-off has only been previously described for the real-weighted single-source shortest paths problem using randomization [Bringmann et al., ICALP'17]. Moreover, our result improves upon the parallelism of the state-of-the-art randomized parallel algorithm for computing transitive closure, which has (O) over tilde (nm+n(3)/d(2)) work and (O) over tilde (d) depth [Ullman and Yannakakis, SIAM J. Comput. '91]. Our APSP algorithm turns out to be a powerful tool for designing efficient planar graph algorithms in both parallel and sequential regimes. By suitably adjusting the depth parameter d and applying known techniques, we obtain: (1) nearly work-efficient (O) over tilde (n(1/6))-depth parallel algorithms for the real-weighted single-source shortest paths problem and finding a bipartite perfect matching in a planar graph, (2) an (O) over tilde (n(9/8))-time sequential strongly polynomial algorithm for computing a minimum mean cycle or a minimum cost-to-time-ratio cycle of a planar graph, (3) a slightly faster algorithm for computing so-called external dense distance graphs of all pieces of a recursive decomposition of a planar graph. One notable ingredient of our parallel APSP algorithm is a simple deterministic (O) over tilde (nm)-work (O) over tilde (d)-depth procedure for computing (O) over tilde (n/d)-size hitting sets of shortest d-hop paths between all pairs of vertices of a real-weighted digraph. Such hitting sets have also been called d-hub sets. Hub sets have previously proved especially useful in designing parallel or dynamic shortest paths algorithms and are typically obtained via random sampling. Our procedure implies, for example, an (O) over tilde (nm)-time deterministic
Genetic programming (GP) has been applied to image classification and achieved promising results. However, most GP-based image classification methods are only applied to small-scale image datasets because of the limit...
详细信息
ISBN:
(纸本)9781450392686
Genetic programming (GP) has been applied to image classification and achieved promising results. However, most GP-based image classification methods are only applied to small-scale image datasets because of the limits of high computation cost. Efficient acceleration technology is needed when extending GP-based image classification methods to large-scale datasets. Considering that fitness evaluation is the most time-consuming phase of the GP evolution process and is a highly parallelized process, this paper proposes a CPU multi-processing and GPU parallel approach to perform the process, and thus effectively accelerate GP for image classification. Through various experiments, the results show that the highly parallelized approach can significantly accelerate GP-based image classification without performance degradation. The training time of GP-based image classification method is reduced from several weeks to tens of hours, enabling it to be run on large-scale image datasets.
Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have receive...
详细信息
ISBN:
(纸本)9781665440660
Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel 50D for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations of AsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyperparameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We o
Power consumption is a significant challenge in the sustainability of computational science. The growing energy demands of increasingly complex simulations and algorithms lead to substantial resource use, which confli...
详细信息
Power consumption is a significant challenge in the sustainability of computational science. The growing energy demands of increasingly complex simulations and algorithms lead to substantial resource use, which conflicts with global sustainability goals. This paper investigates the energy efficiency of different parallel implementations of a Cellular Potts model, which models cellular behavior through Hamiltonian energy minimization techniques, leveraging modern GPU architectures. By evaluating alternative solvers, it demonstrates that specific methods can significantly enhance computational efficiency and reduce energy use compared to traditional approaches. The results confirm notable improvements in execution time and energy consumption. In particular, the experiments show a reduction in terms of power of up to 53%, providing a pathway towards more sustainable high-performance computing practices for complex biological simulations.
Lattice problems are a class of optimization problems that are notably hard. There are no classical or quantum algorithms known to solve these problems efficiently. Their hardness has made lattices a major cryptograph...
详细信息
ISBN:
(纸本)9781665410168
Lattice problems are a class of optimization problems that are notably hard. There are no classical or quantum algorithms known to solve these problems efficiently. Their hardness has made lattices a major cryptographic primitive for post-quantum cryptography. Several different approaches have been used for lattice problems with different computational profiles;some suffer from super-exponential time, and others require exponential space. This motivated us 10 develop a novel lattice problem solver, CMAP-LAP, based on the clever coordination of different algorithms that run massively in parallel. With our flexible framework, heterogeneous modules run asynchronously in parallel on a large-scale distributed system while exchanging information, which drastically boosts the overall performance. We also implement full checkpoint-and-restart functionality, which is vital to high-dimensional lattice problems. CMAP-LAP facilitates the implementation of large-scale parallel strategies for lattice problems since all the functions are designed to he customizable and abstract. Through numerical experiments with up to 103,680 cores, we evaluated the performance and stability of our system and demonstrated its high capability for future massive-scale experiments.
In this paper we present a low latency interface for high-speed multi-FPGA real time simulation. The interface developed is based on a parallel bus structure and has been implemented using two Virtex Ultrascale-plus d...
详细信息
ISBN:
(纸本)9781728184265
In this paper we present a low latency interface for high-speed multi-FPGA real time simulation. The interface developed is based on a parallel bus structure and has been implemented using two Virtex Ultrascale-plus devices. The operation of the interface is -at first- evaluated using a linear feedback shift register to compare numerical values exchanged over the bus. We then proceed providing an example of how the interface is used for the simulation of a power electronics system - composed of two dual active bridge converters- using a time step of 70ns. The results of the decoupled simulation are verified against the one of a monolithic solution running on a single FPGA.
暂无评论