The work is devoted to the synthesis and investigation of parallel algorithm for a finite difference solution of the Poisson equation using the Jacobi method. For example, two-dimensional case demonstrates the efficac...
详细信息
The work is devoted to the synthesis and investigation of parallel algorithm for a finite difference solution of the Poisson equation using the Jacobi method. For example, two-dimensional case demonstrates the efficacy of the method of the pyramids in the synthesis of said algorithm.
Achieving optimal performance of MPI applications on current multi-core architectures, composed of multiple shared communication channels and deep memory hierarchies, is not trivial. Formal analysis using parallel per...
详细信息
Achieving optimal performance of MPI applications on current multi-core architectures, composed of multiple shared communication channels and deep memory hierarchies, is not trivial. Formal analysis using parallel performance models allows one to depict the underlying behavior of the algorithms and their communication complexities, with the aims of estimating their cost and improving their performance. LogGP model was initially conceived to predict the cost of algorithms in mono-processor clusters based on point-to-point transmissions with network latency and bandwidth based parameters. It remains as the representative model, with multiple extensions for handling high performance networks, covering particular contention cases, channels hierarchies or protocol costs. These very specific branches lead LogGP to partially lose its initial abstract modeling purpose. More recent log(n)P represents a point-to-point transmission as a sequence of implicit transfers or data movements. Nevertheless, similar to LogGP, it models an algorithm in a parallel architecture as a sequence of message transmissions, an approach inefficient to model algorithms more advanced than simple tree based one, as we will show in this work. In this paper, tau-Lop model is extended to multi-core clusters and compared to previous models. It demonstrates the ability to predict the cost of advanced algorithms and mechanisms used by mainstream MPI implementations, such as MPICH or Open MPI, with high accuracy. tau-Lop is based on the concept of concurrent transfers, and applies it to meaningfully represent the behavior of parallel algorithms in complex platforms with hierarchical shared communication channels, taking into account the effects of contention and deployment of processes on the processors. In addition, an exhaustive and reproducible methodology for measuring the parameters of the model is described. (C) 2016 Elsevier B.V. All rights reserved.
The article describes extension of lambda-calculation for creation of parallel data mining algorithms. The proposed approach uses presentation of the algorithm as a consequence of pure functions with unified interface...
详细信息
ISBN:
(纸本)9783319219097;9783319219080
The article describes extension of lambda-calculation for creation of parallel data mining algorithms. The proposed approach uses presentation of the algorithm as a consequence of pure functions with unified interfaces. For parallel execution we use special function that allows to change a structure of the algorithm and to implement various strategies for processing of data set and model.
A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelis...
详细信息
A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem's input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the decomposable bulk synchronous parallel model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed.
Recent advancements in high-performance parallel computing platforms and parallel algorithms have significantly enhanced the opportunities for real-time power system protection and control. This paper investigates app...
详细信息
Recent advancements in high-performance parallel computing platforms and parallel algorithms have significantly enhanced the opportunities for real-time power system protection and control. This paper investigates application of Parareal in time algorithm for fast dynamic simulations. Parareal algorithm belongs to the class of temporal decomposition methods which divide the time interval into sub-intervals and solve them concurrently. Time-parallel algorithms face the difficulty of providing correct initial conditions for all the sub-intervals which impact the convergence rates. Parareal overcomes this difficulty by using an approximate trajectory. It has become popular in recent years for long transient simulations (e.g., molecular dynamics, fusion, reacting flows). This paper presents an approach for reliable implementation of Parareal with detailed models of power systems including saturation. Windowing approach is proposed for improving the convergence. Parareal is compared with the Newton-based time-parallel method. Effectiveness of the algorithm is analyzed by parallel emulation using extensive case studies on 10-generator 39-bus system and 327-generator 2383-bus system for various disturbances. Parareal with simulation windows of 1 s have shown convergence in 1 to 3 iterations for majority of the simulated cases, irrespective of the size of the system and nature of the disturbance. All the cases tested have converged with the proposed implementation.
This paper investigates a variant of the work-stealing algorithm that we call the localized work-stealing algorithm. The intuition behind this variant is that because of locality, processors can benefit from working o...
详细信息
This paper investigates a variant of the work-stealing algorithm that we call the localized work-stealing algorithm. The intuition behind this variant is that because of locality, processors can benefit from working on their own work. Consequently, when a processor is free, it makes a steal attempt to get back its own work. We call this type of steal a steal-back. We show that the expected running time of the algorithm is T-1/P + 0(T infinity P), and that under the "even distribution of free agents assumption", the expected running time of the algorithm is T-1/P + 0(T(infinity)lg P). In addition, we obtain another running-time bound based on ratios between the sizes of serial tasks in the computation. If M denotes the maximum ratio between the largest and the smallest serial tasks of a processor after removing a total of 0(P) serial tasks across all processors from consideration, then the expected running time of the algorithm is T-1/P + 0 (T infinity M). (C) 2015 Elsevier B.V. All rights reserved.
Many graphics and also non-graphics applications need efficient techniques to find the nearest neighbors of a given query point. There are two approaches to address this problem: space-partitioning and data partitioni...
详细信息
Many graphics and also non-graphics applications need efficient techniques to find the nearest neighbors of a given query point. There are two approaches to address this problem: space-partitioning and data partitioning. We present a data-partitioning error-controlled strategy for solving the nearest neighbor search (NNS) problem using spatial sorting as the basic building block. We improve on the neighborhood grid method by doing an extensive study on novel spatial sorting strategies for bidimensional NNS, providing significant performance and precision gains over previous works. Experiments demonstrate that, for many dense 2D point distributions, our solution is competitive with more complex and traditional techniques, such as k-d trees and index sorting. We also show comparable results for the 3D case. Our primary contribution is a dynamic, simple to implement, memory efficient, and highly parallelizable solution for low-dimensional approximate nearest neighbor search. (C) 2016 Elsevier Ltd. All rights reserved.
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established...
详细信息
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFLOP/s/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.5 x to 1.92 x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data. (C) 2016 Els
In the research process of the coverage blind spots detection methods in wireless sensor networks, when the current methods are used for detection, the calculation burden is heavy, and a wide range of wireless sensor ...
详细信息
We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with p processors. Given a static text of length (n, we first show how to) compute the suffix array inter...
详细信息
暂无评论