In this paper the diakoptics-based branch-tearing method for solving large electric networks, known as the Multi-Area Thevenin Equivalents (MATE), is combined with the alternating method for parallelizing the transien...
详细信息
In this paper the diakoptics-based branch-tearing method for solving large electric networks, known as the Multi-Area Thevenin Equivalents (MATE), is combined with the alternating method for parallelizing the transient stability simulations of bulk power systems. In the proposed framework, equations associated with dynamic and static devices and passive network are distributed among computing processes. The paper discusses an implementation of the parallel transient stability simulator along with results for two power systems with about 4,000 and 15,000 buses, both based on the Brazilian Interconnected Power System. Performance metrics for assessing the effectiveness of the proposed methodology are also presented and discussed. In order to validate the implementation, the results are also compared with those from an industrial-grade transient stability program, ANATEM developed by CEPEL.
We present a modular approach to implementing dynamic algorithm switching for parallel scientific simulations. Our approach leverages modem software engineering techniques to implement fine-grained control of algorith...
详细信息
We present a modular approach to implementing dynamic algorithm switching for parallel scientific simulations. Our approach leverages modem software engineering techniques to implement fine-grained control of algorithmic behavior in scientific simulations as well as to improve modularity in realizing the algorithm switching functionality onto existing application source code. Through fine-grained control of functional behavior in an application, our approach enables design and implementation of application specific dynamic algorithm switching scenarios. To ensure modularity, our approach considers dynamic algorithm switching as a separate concern with regard to a given application and encourages separate development and transparent integration of the switching functionality without directly modifying the original application code. By applying and evaluating our approach with a real-world scientific application to switch its simulation algorithms dynamically, we demonstrate the applicability and effectiveness of our approach to constructing efficient parallel simulations.
The increasing size of Big Data is often heralded but how data are transformed and represented is also profoundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has be...
详细信息
The increasing size of Big Data is often heralded but how data are transformed and represented is also profoundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has been placed on the scale of the input graph but the product of a graph algorithm can be many times larger than the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scalable graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analytics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graph analysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles in graphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be presented followed by a description of the experiments with that algorithm that led to the current largest and fastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graph exploration technologies is proposed.
This paper presents a new metaprogramming library, CL_ARRAY, that offers multiplatform and generic multidimensional data containers for C++ specifically adapted for parallel programming. The CL_ARRAY containers are bu...
详细信息
This paper presents a new metaprogramming library, CL_ARRAY, that offers multiplatform and generic multidimensional data containers for C++ specifically adapted for parallel programming. The CL_ARRAY containers are built around a new formalism for representing the multidimensional nature of data as well as the semantics of multidimensional pointers and contiguous data structures. We also present OCL_ARRAY VIEW, a concept based on metaprogrammed enveloped objects that supports multidimensional transformations and multidimensional iterators designed to simplify and formalize the interfacing process between OpenCL APIs, standard template library (STL) algorithms and CL_ARRAY containers. Our results demonstrate improved performance and energy savings over the three most popular container libraries available to the developer community for use in the context of multi -linear algebraic applications. (C) 2017 Elsevier Ltd. All rights reserved.
Computing evolutionary relationships on data sets containing hundreds to thousands of taxa easily becomes a daunting task. With recent advances in next-generation sequencing technologies, biological data sets are grow...
详细信息
Computing evolutionary relationships on data sets containing hundreds to thousands of taxa easily becomes a daunting task. With recent advances in next-generation sequencing technologies, biological data sets are growing at an unprecedented pace. This fact turns much harder, either in terms of complexity or scale, to conduct analyses over such large data sets. Therefore, phylogenetics requires new algorithms, methods, and tools to take advantage of parallel hardware and to be able to handle the unprecedented growth of biological data. In this paper, we present parallel SuperFine - a tool for fast and accurate supertree estimation- and its features. parallel SuperFine was derived from SuperFine a state-of-the-art supertree (meta)method. We describe an extension made to SuperFine, which allows to improve significantly its performance, and how the EPIC framework is used to boost the overall performance of parallel SuperFine. Additionally, we pinpoint current limitations that impair to attain (even) a better performance. Our studies reveal that parallel SuperFine allows to reduce, significantly, the time required to perform supertree estimation. Moreover, we show that parallel SuperFine exhibits good scalability, even in the presence of asymmetric biological data sets. Furthermore, the achieved results enable to conclude that the radical improvement in performance does not impair tree accuracy, which is a key issue in phylogenetic inference. (C) 2016 Elsevier B.V. All rights reserved.
The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into perfo...
详细信息
The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom...
详细信息
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time-an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. We only use atomic operations that are provided by the hardware;therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature;in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 7 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 15 or higher.
Linear spectral unmixing is one of the nowadays hottest research topics within the hyperspectral imaging community, being a proof of this fact the vast amount of papers that can be found in the scientific literature a...
详细信息
Linear spectral unmixing is one of the nowadays hottest research topics within the hyperspectral imaging community, being a proof of this fact the vast amount of papers that can be found in the scientific literature about this challenging task. A subset of these works is devoted to the acceleration of previously published unmixing algorithms for application under tight time constraints. For this purpose, hyperspectral unmixing algorithms are typically implemented onto high-performance computing architectures in which the operations involved are executed in parallel, which conducts to a reduction in the time required for unmixing a given hyperspectral image with respect to the sequential version of these algorithms. The speedup factors that can be achieved by means of these high-performance computing platforms heavily depend on the inherent level of parallelism of the algorithms to be executed onto them. However, the majority of the state-of-the-art unmixing algorithms were not originally conceived for being parallelized in an ulterior stage, which clearly restricts the amount of acceleration that can be reached. As far as advanced hyperspectral sensors have increasingly high spatial, spectral, and temporal resolutions, it is hence mandatory to follow a new approach that consists of developing a new class of highly parallel unmixing solutions that can take full advantage of the characteristics of nowadays high-performance computing architectures. This paper represents a step forward toward this direction as it proposes a new parallel algorithm for fully unmixing a hyperspectral image together with its implementation onto two different NVIDIA graphic processing units (GPUs). The results obtained reveal that our proposal is able to unmix hyperspectral images with very different spatial patterns and size better and much faster than the best GPU-based unmixing chains up-to-date published, with independence of the characteristics of the selected GPU.
The nonlinear signal propagation in fibers can be described by the nonlinear Schrodinger equation and the Manakov equation. Most commonly, split-step Fourier methods (SSFM) are applied to solve these nonlinear equatio...
详细信息
The nonlinear signal propagation in fibers can be described by the nonlinear Schrodinger equation and the Manakov equation. Most commonly, split-step Fourier methods (SSFM) are applied to solve these nonlinear equations. The numerical simulation of the nonlinear signal propagation is especially challenging for multimode fibers, particularly if the calculation of very small step sizes or a large number of steps is required. Instead of utilizing SSFM, the fourth-order Runge-Kutta in the Interaction Picture (RK4IP) method can be applied. This method has the potential to reduce the numerical error while simultaneously allowing an increased step size. These advantages come at the price of a higher numerical effort compared to the SSFM method for the same step size. Since the simulation of the signal propagation in multimode fibers is already quite challenging, parallelization becomes an even more interesting option. We demonstrate the adaptation of the RK4IP method to simulate the nonlinear signal propagation in multimode fibers, including its parallelization. Besides comparing the performance of a parallelized implementation for multicore CPUs and a GPU-accelerated version, we discuss efficient strategies to implement the RK4IP method on a GPU accelerator with CUDA. In addition, the RK4IP implementation is numerically compared with a conventional SSFM implementation.
bsp is a bridging model between abstract execution and concrete parallel systems. Structure and abstraction brought by bsp allow to have portable parallel programs with scalable performance predictions, without dealin...
详细信息
bsp is a bridging model between abstract execution and concrete parallel systems. Structure and abstraction brought by bsp allow to have portable parallel programs with scalable performance predictions, without dealing with low-level details of architectures. In the past, we designed bsml for programming bsp algorithms in ml. However, the simplicity of the bsp model does not fit the complexity of today's hierarchical architectures such as clusters of machines with multiple multi-core processors. The multi-bsp model is an extension of the bsp model which brings a tree-based view of nested components of hierarchical architectures. To program multi-bsp algorithms in ml, we propose the multi-ml language as an extension of bsml where a specific kind of recursion is used to go through a hierarchy of computing nodes. We define a formal semantics of the language and present preliminary experiments which show performance improvements with respect to bsml.
暂无评论