parallel programming methodologies are fundamentally dissimilar to those of conventional programming, and software developers without the requisite skillset often find it difficult to adapt to these new methods. This ...
详细信息
ISBN:
(纸本)9781450371964
parallel programming methodologies are fundamentally dissimilar to those of conventional programming, and software developers without the requisite skillset often find it difficult to adapt to these new methods. This is particularly true for parallel programming in a distributed address space, which is necessary for any meaningful degree of scalability. As such, an approach that combines a more intuitive interface together with excellent performance within the distributed address space model is desired. In this work, we present our initial API design and implementation as well as the underlying algorithms for a collective communication library built for the Extended Base Global Address Space (xBGAS) extension to the RISC-V microarchitecture. Our runtime library is designed to enact the Partitioned Global Address Space model (PGAS) in an attempt to alleviate the difficulty associated with traditional distributed address space programming while the underlying collective implementation is formulated to prevent the loss of, and even improve, performance over traditional solutions.
Popular language extensions for parallel programming such as OpenMP or CUDA require considerable compiler support and runtime libraries and are therefore only available for a few programming languages and/or targets. ...
详细信息
ISBN:
(纸本)9781728114361
Popular language extensions for parallel programming such as OpenMP or CUDA require considerable compiler support and runtime libraries and are therefore only available for a few programming languages and/or targets. We present an approach to vectorizing kernels written in an existing generalpurpose language that requires minimal changes to compiler front- ends. Programmers annotate parallel (SPMD) code regions with a few intrinsic functions, which then guide an ordinary automatic vectorization algorithm. This mechanism allows programming SIMD and vector processors effectively while avoiding much of the implementation complexity of more comprehensive and powerful approaches to parallel programming. Our prototype implementation, based on a custom vectorization pass in LLVM, is integrated into C, C++ and Rust compilers using only 29-37 lines of frontend-specific code each.
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities....
详细信息
ISBN:
(纸本)9781450371896
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities. This task is particularly challenging when embedded smart camera platforms have constrained resources such as power consumption, Processing Element (PE) and communication. This article describes a heterogeneous system embedding an FPGA and a GPU for executing CNN inference for computer vision applications. The built system addresses some challenges of embedded CNN such as task and data partitioning, and workload balancing. The selected heterogeneous platform embeds an Nvidia (R) Jetson TX2 for the CPU-GPU side and an Intel Altera (R) Cyclone10GX for the FPGA side interconnected by PCIe Gen2 with a MIPI-CSI camera for prototyping. This test environment will be used as a support for future work on a methodology for optimized model partitioning.
parallel programming skills are becoming more popular due to the unprecedented boom in artificial intelligent and high-performance computing. programming assignments are widely used in parallel programming courses to ...
详细信息
ISBN:
(纸本)9781450362597
parallel programming skills are becoming more popular due to the unprecedented boom in artificial intelligent and high-performance computing. programming assignments are widely used in parallel programming courses to measure student performance and expose students to constraints in real projects. However, due to the difficulty level of these assignments, many students struggle to write fully functional and adequately documented programs. To improve student performance, we implemented a moderated two-stage format for five course projects in a graduate-level introductory parallel programming class. Each project is divided into two stages where students complete the assignment individually without any collaboration in the first stage. Then students work in pairs to work on the same project in the second stage so they can review each other's work from the first stage and improve their programs collaboratively. For two of the five projects, a moderated meeting is conducted in between the two stages where the instructor moderated a group discussion on general issues raised by students. We found that students' performance improved from stage one to stage two. In addition, the two projects with a moderated meeting show better performance gains. This paper also examines students' perceptions of and experiences with the moderated two-stage projects. Students favor working on two-stage projects because they had a chance to discuss challenging concepts and the moderated discussion session tend to guide them to the correct path should they make mistakes in stage one.
The primary purpose of parallel streams in the recent release of Java 8 is to help Java programs make better use of multi-core processors for improved performance. However, in some cases, parallel streams can actually...
详细信息
ISBN:
(纸本)9783030011741;9783030011734
The primary purpose of parallel streams in the recent release of Java 8 is to help Java programs make better use of multi-core processors for improved performance. However, in some cases, parallel streams can actually perform considerably worse than ordinary sequential Java code. This paper presents a Map-Reduce parallel programming pattern for Java parallel streams that produces good speedup over sequential code. An important component of the Map-Reduce pattern is two optimizations: grouping and locality. Three parallel application programs are used to illustrate the Map-Reduce pattern and its optimizations: Histogram of an Image, Document Keyword Search, and Solution to a Differential Equation. A proposal is included for a new terminal stream operation for the Java language called MapReduce() that applies this pattern and its optimizations automatically.
Nowadays development of venous distributed STMS, which aid parallel programming of distributed systems, attracts interest of many researchers. In this paper, we developed the Python distributed STM based on data repli...
详细信息
ISBN:
(纸本)9781728147895
Nowadays development of venous distributed STMS, which aid parallel programming of distributed systems, attracts interest of many researchers. In this paper, we developed the Python distributed STM based on data replication, which provides better performance as well as tolerance to replica faults. The solution supports both eventual and sequential data consistency. Experimental results show that reading t-variables from a local replica is up to 16 times faster than reading them from the base replica.
Many systems used in HPC field have multiple accelerators on a single compute node. However, programming for multiple accelerators is more difficult than that for a single accelerator. Therefore, in this paper, we pro...
详细信息
ISBN:
(纸本)9781450366328
Many systems used in HPC field have multiple accelerators on a single compute node. However, programming for multiple accelerators is more difficult than that for a single accelerator. Therefore, in this paper, we propose an OpenMP extension that allows easy programming for multiple accelerators. We extend existing OpenMP syntax to create Partitioned GlobalAddress Space (PGAS) on separated memories of several accelerators. The feature enables users to perform programming to use multiple accelerators in ease. In performance evaluation, we implement the STREAM Triad and the HIMENO benchmarks using the proposed OpenMP extension. As a result of evaluating the performance on a compute node equipped with up to four GPUs,we con firm that the proposed OpenMP extension demonstrates sufficient performance.
A many-core implementation of the multilevel fast multipole algorithm (MLFMA) based on the Athread parallel programming model for computing electromagnetic scattering by a 3-D object on the homegrown many-core SW26010...
详细信息
ISBN:
(纸本)9781728153049
A many-core implementation of the multilevel fast multipole algorithm (MLFMA) based on the Athread parallel programming model for computing electromagnetic scattering by a 3-D object on the homegrown many-core SW26010 CPU of China is presented. In the proposed many-core implementation of MLFMA, the data access efficiency is improved by using data structures based on the Structure-of-Array (SoA). The adaptive workload distribution strategies are adopted on different MLFMA tree levels to ensure full utilization of computing capability and the scratchpad memory (SPM). A double-buffering scheme is specially designed to make communication overlapped computation. The resulting Athread-based many-core implementation of the MLFMA is capable for solving real-life problems with over four hundred thousand unknowns with a remarkable speed-up. Numerical results show that with the proposed parallel scheme, a total speed-up larger than 7 times can be achieved, compared with the CPU master-core.
Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. ...
详细信息
ISBN:
(纸本)9781450376389
Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads.
The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined us...
详细信息
The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined usage of numerical and hardware strategies to control the computational complexity and provide the needed acceleration. Recently, we have set up a multi-stage approach in which the first stage employs global optimization with a rough, computationally convenient modeling of the radiation, while the subsequent stages employ local optimization on gradually refined radiation models. The purpose of this paper is to show how reflectarray antenna synthesis can take profit from parallel computing on Graphics Processing Units (GPUs) using the CUDA language. In particular, parallel computing is adopted along two lines. First, the presented approach accelerates a Particle Swarm Optimization procedure exploited for the first stage. Second, it accelerates the computation of the field radiated by the reflectarray using a GPU-implemented Non-Uniform FFT routine which is used by all the stages. The numerical results show how the first stage of the optimization process is crucial to achieve, at an acceptable computational cost, a good starting point.
暂无评论