This paper presents a new metaprogramming library, CL_ARRAY, that offers multiplatform and generic multidimensional data containers for C++ specifically adapted for parallel programming. The CL_ARRAY containers are bu...
详细信息
This paper presents a new metaprogramming library, CL_ARRAY, that offers multiplatform and generic multidimensional data containers for C++ specifically adapted for parallel programming. The CL_ARRAY containers are built around a new formalism for representing the multidimensional nature of data as well as the semantics of multidimensional pointers and contiguous data structures. We also present OCL_ARRAY VIEW, a concept based on metaprogrammed enveloped objects that supports multidimensional transformations and multidimensional iterators designed to simplify and formalize the interfacing process between OpenCL APIs, standard template library (STL) algorithms and CL_ARRAY containers. Our results demonstrate improved performance and energy savings over the three most popular container libraries available to the developer community for use in the context of multi -linear algebraic applications. (C) 2017 Elsevier Ltd. All rights reserved.
The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into perfo...
详细信息
The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.
In this paper the diakoptics-based branch-tearing method for solving large electric networks, known as the Multi-Area Thevenin Equivalents (MATE), is combined with the alternating method for parallelizing the transien...
详细信息
In this paper the diakoptics-based branch-tearing method for solving large electric networks, known as the Multi-Area Thevenin Equivalents (MATE), is combined with the alternating method for parallelizing the transient stability simulations of bulk power systems. In the proposed framework, equations associated with dynamic and static devices and passive network are distributed among computing processes. The paper discusses an implementation of the parallel transient stability simulator along with results for two power systems with about 4,000 and 15,000 buses, both based on the Brazilian Interconnected Power System. Performance metrics for assessing the effectiveness of the proposed methodology are also presented and discussed. In order to validate the implementation, the results are also compared with those from an industrial-grade transient stability program, ANATEM developed by CEPEL.
The nonlinear signal propagation in fibers can be described by the nonlinear Schrodinger equation and the Manakov equation. Most commonly, split-step Fourier methods (SSFM) are applied to solve these nonlinear equatio...
详细信息
The nonlinear signal propagation in fibers can be described by the nonlinear Schrodinger equation and the Manakov equation. Most commonly, split-step Fourier methods (SSFM) are applied to solve these nonlinear equations. The numerical simulation of the nonlinear signal propagation is especially challenging for multimode fibers, particularly if the calculation of very small step sizes or a large number of steps is required. Instead of utilizing SSFM, the fourth-order Runge-Kutta in the Interaction Picture (RK4IP) method can be applied. This method has the potential to reduce the numerical error while simultaneously allowing an increased step size. These advantages come at the price of a higher numerical effort compared to the SSFM method for the same step size. Since the simulation of the signal propagation in multimode fibers is already quite challenging, parallelization becomes an even more interesting option. We demonstrate the adaptation of the RK4IP method to simulate the nonlinear signal propagation in multimode fibers, including its parallelization. Besides comparing the performance of a parallelized implementation for multicore CPUs and a GPU-accelerated version, we discuss efficient strategies to implement the RK4IP method on a GPU accelerator with CUDA. In addition, the RK4IP implementation is numerically compared with a conventional SSFM implementation.
A fast multipole method (FMM)/graphics processing unit-accelerated boundary element method (BEM) for computational magnetics and electrostatics via the Laplace equation is presented. The BEM is an integral method, but...
详细信息
A fast multipole method (FMM)/graphics processing unit-accelerated boundary element method (BEM) for computational magnetics and electrostatics via the Laplace equation is presented. The BEM is an integral method, but the FMM is typically designed around monopole and dipole sources. To apply the FMM to the integral expressions in the BEM, the internal data structures and logic of the FMM must be changed. However, this can be difficult. For example, computing the multipole expansions due to the boundary elements requires computing single and double surface integrals over them. Moreover, FMM codes for monopole and dipole sources are widely available and highly optimized. This paper describes a method for applying the FMM unchanged to the integral expressions in the BEM. This method, called the correction factor matrix method, works by approximating the integrals using a quadrature. The quadrature points are treated as monopole and dipole sources, which can be plugged directly into current FMM codes. The FMM is effectively treated as a black box. Inaccuracies from the quadrature are corrected during a correction factor step. The method is derived, and example problems are presented showing accuracy and performance.
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom...
详细信息
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently insert, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time-an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. We only use atomic operations that are provided by the hardware;therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature;in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 7 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 15 or higher.
SyDPaCC is a set of libraries for the Coq proof assistant. It allows to write naive functional programs (i.e. with high complexity) that are considered as specifications, and to transform them into more efficient vers...
详细信息
SyDPaCC is a set of libraries for the Coq proof assistant. It allows to write naive functional programs (i.e. with high complexity) that are considered as specifications, and to transform them into more efficient versions. These more efficient versions can then be automatically parallelised before being extracted from Coq into source code for the functional language OCaml together with calls to the Bulk Synchronous parallel ML library. In this paper we present a new core version of SyDPaCC for the development of parallel programs correct-by-construction using the theory of list homomorphisms and algorithmic skeletons implemented and verified in Coq. The framework is illustrated on the maximum prefix sum problem.
Computing evolutionary relationships on data sets containing hundreds to thousands of taxa easily becomes a daunting task. With recent advances in next-generation sequencing technologies, biological data sets are grow...
详细信息
Computing evolutionary relationships on data sets containing hundreds to thousands of taxa easily becomes a daunting task. With recent advances in next-generation sequencing technologies, biological data sets are growing at an unprecedented pace. This fact turns much harder, either in terms of complexity or scale, to conduct analyses over such large data sets. Therefore, phylogenetics requires new algorithms, methods, and tools to take advantage of parallel hardware and to be able to handle the unprecedented growth of biological data. In this paper, we present parallel SuperFine - a tool for fast and accurate supertree estimation- and its features. parallel SuperFine was derived from SuperFine a state-of-the-art supertree (meta)method. We describe an extension made to SuperFine, which allows to improve significantly its performance, and how the EPIC framework is used to boost the overall performance of parallel SuperFine. Additionally, we pinpoint current limitations that impair to attain (even) a better performance. Our studies reveal that parallel SuperFine allows to reduce, significantly, the time required to perform supertree estimation. Moreover, we show that parallel SuperFine exhibits good scalability, even in the presence of asymmetric biological data sets. Furthermore, the achieved results enable to conclude that the radical improvement in performance does not impair tree accuracy, which is a key issue in phylogenetic inference. (C) 2016 Elsevier B.V. All rights reserved.
Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, which is i...
详细信息
Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, which is incremented as they go through the loop. Instruction execution is triggered when all input operands with the same tag become available. However, this traditional tagging mechanism often requires the generation of several control instructions to manipulate tags and guarantee the correct match. To address this problem, this work presents three dataflow loop optimisation techniques. The stack-tagged dataflow is a tagging mechanism that uses stacks of tags to reduce control overheads in dataflow. On the other hand, as nested loops may increase the overhead of stack-tag comparison, tag resetting can be used to set the tag to zero whenever it is safe, allowing a one-level reduction at the stack depth. Finally, loop skipping allows to further avoid stack comparison overhead in loops, when the number of iterations can be determined by the compiler. Experimental results show the overhead, drawbacks and benefits for the three optimisations presented. Moreover, the results suggested that a hybrid compiling approach can be used to get the best performance of each technique.
bsp is a bridging model between abstract execution and concrete parallel systems. Structure and abstraction brought by bsp allow to have portable parallel programs with scalable performance predictions, without dealin...
详细信息
bsp is a bridging model between abstract execution and concrete parallel systems. Structure and abstraction brought by bsp allow to have portable parallel programs with scalable performance predictions, without dealing with low-level details of architectures. In the past, we designed bsml for programming bsp algorithms in ml. However, the simplicity of the bsp model does not fit the complexity of today's hierarchical architectures such as clusters of machines with multiple multi-core processors. The multi-bsp model is an extension of the bsp model which brings a tree-based view of nested components of hierarchical architectures. To program multi-bsp algorithms in ml, we propose the multi-ml language as an extension of bsml where a specific kind of recursion is used to go through a hierarchy of computing nodes. We define a formal semantics of the language and present preliminary experiments which show performance improvements with respect to bsml.
暂无评论