Purpose: To present the implementation of a new option for parallel processing of the EGSnrc Monte Carlo system using the OpenMP API, as an alternative to the provided method based on the use of a batch queuing system...
详细信息
Purpose: To present the implementation of a new option for parallel processing of the EGSnrc Monte Carlo system using the OpenMP API, as an alternative to the provided method based on the use of a batch queuing system (BQS). Methods: The parallel solution presented, called OMP_EGS, makes use of OpenMP features to control the workload distribution between the compute units. These features were inserted into the original EGSnrc source code through properly defined macros. In order to validate the platform, the possibility of producing results in exact agreement with the serial implementation was assessed. The performance of OMP_EGS was evaluated against the BQS method, in terms of parallel speedup and efficiency. Results: As the OpenMP features can be activated or deactivated depending on the compilation options, the implementation of the platform allowed the direct recovery of the original serial implementation. The validation tests showed that OMP_EGS was able to reproduce the exact same results as the serial implementation. The performance and scalability tests showed that OMP_EGS is a better alternative than the EGSnrc BQS parallel implementation, both in terms of runtime and parallel efficiency. Conclusions: The presented solution has several advantages over the BQS-based parallel implementation available for the EGSnrc system. One of the main advantages is that, in contrast to the BQS alternative, it can be implemented using different compilers and operative systems, which turns it into a compact and portable solution that can be used on a wide range of working environments. It does not introduce artifacts on the simulated distributions, as it only handles the distribution of work among the available computing resources and it proved to have a better performance. (C) 2017 American Association of Physicists in Medicine.
Power distribution networks operate in a radial topology, but also include extra tie switches to allow for their reconfiguration in case of scheduled maintenance or unexpected failure. With the implementation of the s...
详细信息
Power distribution networks operate in a radial topology, but also include extra tie switches to allow for their reconfiguration in case of scheduled maintenance or unexpected failure. With the implementation of the smart grid and the development of fast high power switching devices, it is now possible to automatize this reconfiguration to also adjust to demand fluctuation and always operate the network in the optimal topology, minimizing power transmission losses. This automation requires the development of highly efficient and powerful optimization algorithms that can compute the optimal configuration with minimum delay. This paper presents a parallel genetic algorithm on graphics processing unit for distribution feeder reconfiguration. By exploiting the massively parallel architecture of graphics processors, the execution time of the solver is reduced by a factor of 66.2x, resulting in a very fast solver. Moreover, the metaheuristic uses a unique solution encoding based on the minimum spanning tree to maintain the radial structure of the candidate topologies. This novel encoding drastically improves the effectiveness of the genetic algorithm and allows for the optimal reconfiguration of networks up to 4400 buses;five times larger than any of the references surveyed.
Graphs can be used to model many kinds of data, from traditional datasets to social networks or semi-structured datasets. To process large graphs, many systems have been proposed. The Pregel programming model is popul...
详细信息
Graphs can be used to model many kinds of data, from traditional datasets to social networks or semi-structured datasets. To process large graphs, many systems have been proposed. The Pregel programming model is popular, thanks to its scalability. Although Pregel is simple to understand and use, it is of low-level in programming and requires developers to write programs that are hard to maintain and need to be carefully optimized. On the other hand, structural recursion is powerful to systematically construct efficient parallel programs on lists, arrays and trees, but it has not yet been applied to graphs. In this paper, we propose an efficient method for parallel evaluation of structural recursion on graphs, which is suitable for Pregel. We design and implement a high-level parallel programming framework where a domain-specific language (DSL) is provided to ease the programing task. Specifications written in the DSL are automatically compiled into Pregel programs that are scalable for large graphs. Experimental results show that our framework outperforms the original evaluation of structural recursion, and achieves good scalability and speedup for real datasets.
Genetic programming (GP) is a computationally intensive technique which also has a high degree of natural parallelism. parallel computing architectures have become commonplace especially with regards to Graphics Proce...
详细信息
Genetic programming (GP) is a computationally intensive technique which also has a high degree of natural parallelism. parallel computing architectures have become commonplace especially with regards to Graphics Processing Units(GPU). Hence, versions of GP have been implemented that utilise these highly parallel computing platforms enabling significant gains in the computational speed of GP to be achieved. However, recently a two-dimensional stack approach to GP using a multi-core CPU also demonstrated considerable performance gains. Indeed, performances equivalent to or exceeding that achieved by a GPU were demonstrated. This paper will demonstrate that a similar two-dimensional stack approach can also be applied to a GPU-based approach to GP to better exploit the underlying technology. Performance gains are achieved over a standard single-dimensional stack approach when utilising a GPU. Overall, a peak computational speed of over 55 billion Genetic programming Operations per Second are observed, a twofold improvement over the best GPU-based single-dimensional stack approach from the literature.
Big Data concerns with large-volume complex growing data. Given the fast development of data storage and network, organizations are collecting large ever-growing datasets that can have useful information. In order to ...
详细信息
Big Data concerns with large-volume complex growing data. Given the fast development of data storage and network, organizations are collecting large ever-growing datasets that can have useful information. In order to extract information from these datasets within useful time, it is important to use distributed and parallel algorithms. One common usage of big data is machine learning, in which collected data is used to predict future behavior. Deep-Learning using Artificial Neural Networks is one of the popular methods for extracting information from complex datasets. Deep-learning is capable of more creating complex models than traditional probabilistic machine learning techniques. This work presents a step-by-step guide on how to prototype a Deep-Learning application that executes both on GPU and CPU clusters. Python and Redis are the core supporting tools of this guide. This tutorial will allow the reader to understand the basics of building a distributed high performance GPU application in a few hours. Since we do not depend on any deep-learning application or framework-we use low-level building blocks-this tutorial can be adjusted for any other parallel algorithm the reader might want to prototype on Big Data. Finally, we will discuss how to move from a prototype to a fully blown production application. (C) 2017 Elsevier Inc. All rights reserved.
Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level para...
详细信息
Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers. (C) 2017 Elsevier B.V. All rights reserved.
Hadoop on datacentre is a popular analytical platform for enterprises. Cloud vendors host Hadoop clusters on the datacentre to provide high performance analytical computing facilities to its customers, who demand a pa...
详细信息
Hadoop on datacentre is a popular analytical platform for enterprises. Cloud vendors host Hadoop clusters on the datacentre to provide high performance analytical computing facilities to its customers, who demand a parallel programming model to deal with huge data. Effective cost/time management and ingenious resource consumption among the concurrent users, must be the primary concern without which the key aspiration behind high performance cloud computing would suffer. Workflows portray such high performance applications in terms of individual jobs and dependencies between them. Workflows can be scheduled on virtual machines (VMs) in datacentre to make best possible use of resources. In the authors' earlier work, a mechanism to pack and execute the customer jobs as workflows on Hadoop platform was proposed which minimises the VM cost and also executes the workflow jobs within deadline. In this work, the authors try to optimise certain other parameters such as load on cloud, response time for workflows, resource usage effectiveness by applying soft computing methods. Stochastic hill climbing (SCH) is a soft computing approach used to solve many optimisation problems. In this study, they have employed the SHC approach to schedule workflow jobs to VMs and thereby optimise the above mentioned multiple parameters in cloud datacentre.
The proliferation of parallel processing in shared-memory applications has encouraged developing assistant frameworks such as OpenMP. OpenMP has become increasingly prevalent due to the simplicity it offers to elegant...
详细信息
The proliferation of parallel processing in shared-memory applications has encouraged developing assistant frameworks such as OpenMP. OpenMP has become increasingly prevalent due to the simplicity it offers to elegantly and incrementally introduce parallelism. However, it still lacks some high-level language features that are essential in object-oriented programming. One such mechanism is that of exception handling. In languages such as Java, the concept of exception handling has been an integral aspect to the language since the first release. For OpenMP to be truly embraced within this object-oriented community, essential object-oriented concepts such as exception handling need to be given some attention. The official OpenMP standard has little specification on error recovery, as the challenges of supporting exception-based error recovery in OpenMP extends to both the semantic specifications and related runtime support. This paper proposes a systematic mechanism for exception handling with the co-use of OpenMP directives, which is based on a Java implementation of OpenMP. The concept of exception handling with OpenMP directives has been formalized and categorized. Hand in hand with this exception handling proposal, a flexible approach to thread cancellation is also proposed (as an extension on OpenMP directives) that supports this exception handling within parallel execution. The runtime support and its implementation are discussed. The evaluation shows that while there is no prominent overhead introduced, the new approach provides a more elegant coding style which increases the parallel development efficiency and software robustness.
Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizati...
详细信息
Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.
The power flow (PF) analysis provides the steady state of the power system and is key to the simulation of transmission networks. It is a tool commonly used by system operators to visualize the effect of generator set...
详细信息
The power flow (PF) analysis provides the steady state of the power system and is key to the simulation of transmission networks. It is a tool commonly used by system operators to visualize the effect of generator settings on the network prior to making a change. In situations involving large networks, hundreds or even thousands of PF analysis may have to be run on the network before finding the optimal power dispatch. This process requires significant computation time and does not allow for rapid control of the network. To address this problem, this paper presents two parallel PF solvers that exploit the massively parallel architecture of graphics processing units (GPU) in a hybrid GPU-central processing unit (CPU) computing environment using compute unified device architecture and OpenMP in order to significantly speedup the concurrent analysis of many instances of a network. Both implementations use sparse matrices, double precision operations, and enforce the reactive power limit of generators. The parallel Gauss-Seidel (G-S) and Newton-Raphson (N-R) PF algorithms are tested on networks ranging from 4 to 2383 buses. The accuracy is validated using MATPOWER and the maximum speedup achieved, compared with a sequential execution on CPU, is 45.2x for G-S and 17.8x for N-R.
暂无评论