SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through differ...
详细信息
ISBN:
(纸本)9783030049188;9783030049171
SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through different generations. The current generation of the SHMEM implementation for Cray XC and XK systems is called Cray SHMEM. It is a proprietary SHMEM implementation from Cray Inc. In this work, we provide an in-depth analysis of need for a new SHMEM implementation and then introduce the next evolution of Cray SHMEM implementation for current and future generation Cray systems. We call this new implementation Cray OpenSHMEMX. We provide brief design overview, along with a review of functional and performance differences in Cray OpenSHMEMX comparing against the existing Cray SHMEM implementation.
"In silico" experimentation allows us to simulate the effect of different therapies by handling model parameters. Although the computational simulation of tumors is currently a well-known technique, it is ho...
详细信息
ISBN:
(纸本)9783319987026;9783319987019
"In silico" experimentation allows us to simulate the effect of different therapies by handling model parameters. Although the computational simulation of tumors is currently a well-known technique, it is however possible to contribute to its improvement by parallelizing simulations on computer systems of many and multi-cores. This work presents a proposal to parallelize a tumor growth simulation that is based on cellular automata by partitioning of the data domain and by dynamic load balancing. The initial results of this new approach show that it is possible to successfully accelerate the calculations of a known algorithm for tumor-growth.
The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined us...
详细信息
The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined usage of numerical and hardware strategies to control the computational complexity and provide the needed acceleration. Recently, we have set up a multi-stage approach in which the first stage employs global optimization with a rough, computationally convenient modeling of the radiation, while the subsequent stages employ local optimization on gradually refined radiation models. The purpose of this paper is to show how reflectarray antenna synthesis can take profit from parallel computing on Graphics Processing Units (GPUs) using the CUDA language. In particular, parallel computing is adopted along two lines. First, the presented approach accelerates a Particle Swarm Optimization procedure exploited for the first stage. Second, it accelerates the computation of the field radiated by the reflectarray using a GPU-implemented Non-Uniform FFT routine which is used by all the stages. The numerical results show how the first stage of the optimization process is crucial to achieve, at an acceptable computational cost, a good starting point.
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflowexecution must answer the following question: How can parallelism between two dependent n...
详细信息
ISBN:
(纸本)9781450368131
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflowexecution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution. Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations. In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-theart big data processing systems.
A many-core implementation of the multilevel fast multipole algorithm (MLFMA) based on the Athread parallel programming model for computing electromagnetic scattering by a 3-D object on the homegrown many-core SW26010...
详细信息
ISBN:
(纸本)9781728153049
A many-core implementation of the multilevel fast multipole algorithm (MLFMA) based on the Athread parallel programming model for computing electromagnetic scattering by a 3-D object on the homegrown many-core SW26010 CPU of China is presented. In the proposed many-core implementation of MLFMA, the data access efficiency is improved by using data structures based on the Structure-of-Array (SoA). The adaptive workload distribution strategies are adopted on different MLFMA tree levels to ensure full utilization of computing capability and the scratchpad memory (SPM). A double-buffering scheme is specially designed to make communication overlapped computation. The resulting Athread-based many-core implementation of the MLFMA is capable for solving real-life problems with over four hundred thousand unknowns with a remarkable speed-up. Numerical results show that with the proposed parallel scheme, a total speed-up larger than 7 times can be achieved, compared with the CPU master-core.
The importance of concurrent and distributed programming is increasing on Computer Science curricula. This exploratory research identifies additional notions required by the official topics of "parallel and Concu...
详细信息
ISBN:
(纸本)9781728147871
The importance of concurrent and distributed programming is increasing on Computer Science curricula. This exploratory research identifies additional notions required by the official topics of "parallel and Concurrent programming" course, taught at the University of Costa Rica. This paper characterizes previous knowledge that students had about these notions and the extracurricular effort that they made to overcome the lack of notions. Findings show that students were able to overcome the lack of notions at expense of more extracurricular effort. Exploratory evidence indicates that students' election of professors in previous courses influenced their performance and extracurricular effort in the parallel programming course.
Since the computing world has become fully parallel, every software developer today should be familiar with the notion of "parallel algorithm structure." If in recent years, students have studied a basic int...
详细信息
ISBN:
(纸本)9783030105495;9783030105488
Since the computing world has become fully parallel, every software developer today should be familiar with the notion of "parallel algorithm structure." If in recent years, students have studied a basic introduction to algorithms;today, parallel algorithm structure must become a vital part of computer science education. In this work we present two years of experience teaching a "Supercomputer Modeling and Technologies" course, and running practical assignments at the Computational Mathematics and Cybernetics faculty of Lomonosov Moscow State University, aimed at teaching students a methodology for analyzing parallel algorithm properties.
Peachy parallel assignments are high-quality assignments for teaching parallel and distributed computing. They have been successfully used in class and are selected on the basis of their suitability for adoption and f...
详细信息
ISBN:
(纸本)9781728159751
Peachy parallel assignments are high-quality assignments for teaching parallel and distributed computing. They have been successfully used in class and are selected on the basis of their suitability for adoption and for being cool and inspirational for students. Here we present a fire fighting simulation, thread-to-core mapping on NUMA nodes, introductory cloud computing, interesting variations on prefix-sum, searching for a lost PIN, and Big Data analytics.
With the decline of Moore's law and the ever increasing availability of cheap massively parallel hardware, it becomes more and more important to embrace parallel programming methods to implement Agent-Based Simula...
详细信息
With the decline of Moore's law and the ever increasing availability of cheap massively parallel hardware, it becomes more and more important to embrace parallel programming methods to implement Agent-Based Simulations (ABS). This has been acknowledged in the field a while ago and numerous research on distributed parallel ABS exists, focusing primarily on parallel Discrete Event Simulation as the underlying mechanism. However, these concepts and tools are inherently difficult to master and apply and often an excess in case implementers simply want to parallelise their own, custom agent-based model implementation. However, with the established programming languages in the field, Python, Java and C++, it is not easy to address the complexities of parallel programming due to unrestricted side effects and the intricacies of low-level locking semantics. Therefore, in this paper we propose the use of a lock-free approach to parallel ABS using Software Transactional Memory (STM) in conjunction with the pure functional programming language Haskell, which in combination, removes some of the problems and complexities of parallel implementations in imperative approaches. We present two case studies, in which we compare the performance of lock-based and lock-free STM implementations in two different well known Agent-Based Models, where we investigate both the scaling performance under increasing number of CPU cores and the scaling performance under increasing number of agents. We show that the lock-free STM implementations consistently outperform the lock-based ones and scale much better to increasing number of CPU cores both on local hardware and on Amazon EC. Further, by utilizing the pure functional language Haskell we gain the benefits of immutable data and lack of unrestricted side effects guaranteed at compile-time, making validation easier and leading to increased confidence in the correctness of an implementation, something of fundamental importance and benefit in paral
We propose a stock market software architecture extended by a graphics processing unit, which employs parallel programming paradigm techniques to optimize long-running tasks like computing daily trends and performing ...
详细信息
ISBN:
(数字)9783030239763
ISBN:
(纸本)9783030239763;9783030239756
We propose a stock market software architecture extended by a graphics processing unit, which employs parallel programming paradigm techniques to optimize long-running tasks like computing daily trends and performing statistical analysis of stock market data in realtime. The system uses the ability of Nvidia's CUDA parallel computation application programming interface (API) to integrate with traditional web development frameworks. The web application offers extensive statistics and stocks' information which is periodically recomputed through scheduled batch jobs or calculated in real-time. To illustrate the advantages of using many-core programming, we explore several use-cases and evaluate the improvement in performance and speedup obtained in comparison to the traditional approach of executing long-running jobs on a central processing unit (CPU).
暂无评论