With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoiding algorithms. The distributed memo...
详细信息
With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoiding algorithms. The distributed memory Fast Fourier Transform is an important algorithm which suffers from major communication bottlenecks. In this work, we take a look at an existing communication-avoiding algorithm FMM-FFT, an alternative to FFT which utilizes the Fast Multipole Method (FMM) to reduce communications to a single all-to-all communication. We present a detailed implementation of FMM-FFT relying on modern libraries and demonstrate it on two distinct distributed memory architectures notably a traditional Intel Xeon based HPC cluster and then a Beowulf cluster. We show that while the FMM-FFT is significantly slower than FFT on the traditional HPC cluster, on the Beowulf cluster it outperforms standard FFT, consistently getting speedups of 1.5x or more against FFTW. We then proceed to show how the communication to computation cost metric is important and useful in explaining the performance results of FMM-FFT against standard FFT. The source code pertaining to this work is being made publicly available under a permissive open source licence at Github.
parallel reduction is a major component of parallel programming and widely used for summarisation and aggregation. It is not well understood, however, what sorts of non-trivial summarisations can be implemented as par...
详细信息
parallel reduction is a major component of parallel programming and widely used for summarisation and aggregation. It is not well understood, however, what sorts of non-trivial summarisations can be implemented as parallel reductions. This paper develops a calculus named lambda(AS), a simply typed lambda calculus with algebraic simplification. This calculus provides a foundation for studying a parallelisation of complex reductions by equational reasoning. Its key feature is delta abstraction. A delta abstraction is observationally equivalent to the standard lambda abstraction, but its body is simplified before the arrival of its arguments using algebraic properties such as associativity and commutativity. In addition, the type system of lambda(AS) guarantees that simplifications due to delta abstractions do not lead to serious overheads. The usefulness of lambda(AS) is demonstrated on examples of developing complex parallel reductions, including those containing more than one reduction operator, loops with conditional jumps, prefix sum patterns and even tree manipulations.
Web directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult tas...
详细信息
Web directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi-automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web-documents. We have used centrality-based techniques to split the corpus into fine-grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse-grained clusters in the web-directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster-validity indices, like F-measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.
The synchronous language SIGNAL is a formal specification formalism for developing safety-critical real-time systems. It is a multi-clocked data-flow modeling language suitable for specifying deterministic concurrent ...
详细信息
The synchronous language SIGNAL is a formal specification formalism for developing safety-critical real-time systems. It is a multi-clocked data-flow modeling language suitable for specifying deterministic concurrent behaviors. Its model of computation and communication very well matches recent trends to utilize multi-core processors for executing real-time systems, by taking advantage of its concurrent semantics. The SIGNAL compiler generates code from data-flow specifications while analyzing and verifying safety properties of the system under design: deadlock-freedom, determinism. However, most of recent works have focused on generating sequential code from SIGNAL. Choosing the parallel library OpenMP as the target, this paper proposes a methodology to generate and verify concurrent code automatically from SIGNAL specifications. This is done by first exploring clock relations among signals by application of a so-called clock calculus. Then, specifications are translated into EDGs (Equation-Dependency Graphs) to analyze global data-dependency relations. An EDG is then partitioned into concurrent tasks to help explore parallelism in the original specification while preserving its semantic. Combined with clock relations, parallel tasks are finally mapped onto the OpenMP structures. The proposed approach is illustrated by a realistic case study. (C) 2021 Elsevier B.V. All rights reserved.
This paper presents the general framework of a parallel cooperative hyper-heuristic optimizer (PCHO) to solve systems of nonlinear algebraic equations with equality and inequality constraints. The algorithm comprises ...
详细信息
This paper presents the general framework of a parallel cooperative hyper-heuristic optimizer (PCHO) to solve systems of nonlinear algebraic equations with equality and inequality constraints. The algorithm comprises the classical metaheuristics called Genetic Algorithms, Simulated Annealing and Particle Swarm Optimization, whose parameters are adaptively chosen during the executions. A Master-Worker architecture was designed and implemented, where the Master processor ranks the solution candidates informed by the metaheuristics and immediately communicates the most promising candidate to update all Workers. Algorithmic performance was tested with general models, most of them corresponding to PSE process systems. The results confirmed the efficiency of the proposed approach since both online parameter retuning and parallel processing sped up the search. (c) 2021 Elsevier Ltd. All rights reserved.
The N-body simulations consist of computing mutual gravitational forces exerted on each body in O(N). The Barnes-Hut approximation allows processing a group of bodies in O(1) if they are far enough from a given body, ...
详细信息
The N-body simulations consist of computing mutual gravitational forces exerted on each body in O(N). The Barnes-Hut approximation allows processing a group of bodies in O(1) if they are far enough from a given body, which drops the complexity of the whole simulation to O(NLogN). The octree is used to ease the pruning process but at the cost of some irregularity in the access pattern. In a parallel N-body implementation the bodies are partitioned among threads that are executed on multiple cores. The depth-first traversal of the octree is used for processing each body, which causes repeated cache misses during traversal. This paper proposes different types of tiling methods to improve the performance of N-body simulations. It presents an experimental analysis of octree traversal by using these tiling methods to identify the potential of cache data reuse. It then evaluates these tiling methods for varying tile sizes with different galaxy sizes and a varying number of threads on several machine architectures. The efficiency of tiling approaches depends on the chosen tile size. It is shown that a speedup of 8 times can be achieved by choosing the appropriate tile size on a 60-core Intel accelerator. In order to determine appropriate tile size, the paper proposes an adaptive tiling approach to implicitly adapt the tile size to the distribution of threads, the cache capacity, cache latency, problem size and dynamic changes in the access pattern over the iterations. The proposed adaptive tiling approach can be used as an optimization option in parallel compilers. (C) 2021 Elsevier B.V. All rights reserved.
Thunderstorms represent a major hazard for flights, as they compromise the safety of both the airframe and the passengers. To address trajectory planning under thunderstorms, three variants of the scenario-based rapid...
详细信息
Thunderstorms represent a major hazard for flights, as they compromise the safety of both the airframe and the passengers. To address trajectory planning under thunderstorms, three variants of the scenario-based rapidly exploring random trees (SB-RRTs) are proposed. During an iterative process, the so-called SB-RRT, the SB-RRT* and the Informed SB-RRT* find safe trajectories by meeting a user-defined safety threshold. Additionally, the last two techniques converge to solutions of minimum flight length. Through parallelization on graphical processing units the required computational times are reduced substantially to become compatible with near real-time operation. The proposed methods are tested considering a kinematic model of an aircraft flying between two waypoints at constant flight level and airspeed;the test scenario is based on a realistic weather forecast and assumed to be described by an ensemble of equally likely members. Lastly, the influence of the number of scenarios, safety margin and iterations on the results is analyzed. Results show that the SB-RRTs are able to find safe and, in two of the algorithms, close to-optimum solutions.
This paper proposes a method for accelerating an enhanced resolution 3D Multiple Input Multiple Output (MIMO) radar on a Graphics Processing Unit (GPU). Due to the size of the data required for range, bearing, and dop...
详细信息
K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as k increases. Many algorithms exist for compr...
详细信息
ISBN:
(纸本)9781728174457
K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as k increases. Many algorithms exist for compressed storage of k-mers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for C++ and provides set- and map-like structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.
The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. parallel stream processing can be implemented for handling high frequency and...
详细信息
ISBN:
(纸本)9783030410506;9783030410490
The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.
暂无评论