Component-oriented programming has been applied to address the requirements of large-scale applications from computational sciences and engineering that present high performance computing (HPC) requirements. However, ...
详细信息
Component-oriented programming has been applied to address the requirements of large-scale applications from computational sciences and engineering that present high performance computing (HPC) requirements. However, parallelism continues to be a challenging requirement in the design of CBHPC (Component-Based High Performance Computing) platforms. This paper presents strong evidence about the efficacy and the efficiency of HPE (Hash programming Environment), a CBHPC platform that provides full support for parallel programming, on the development, deployment and execution of numerical simulation code onto cluster computing platforms. (C) 2012 Elsevier Inc. All rights reserved.
programmingparallel architectures using a hierarchical point of view is becoming today's standard as machines are structured by multiple layers of memories. To handle such architectures, we focus on the MULTI-BSP...
详细信息
ISBN:
(数字)9781665488020
ISBN:
(纸本)9781665488020
programmingparallel architectures using a hierarchical point of view is becoming today's standard as machines are structured by multiple layers of memories. To handle such architectures, we focus on the MULTI-BSP bridging model. This model extends BSP and proposes a structured way of programming multi-level architectures. In the context of parallel programming we, now need to manage new concerns such as memory coherency, deadlocks and safe data communications. To do so, we propose a typing system for MULTI-ML, a ML-like programming language based on the MULTI-BSP model. This type system introduces data locality using type annotations and effects to be able to detected wrong uses of multi-level architectures. We thus ensure that "Well-typed programs cannot go wrong" on hierarchical architectures.
The triangular Shepard interpolation method is an extension of the well-known bivariate Shepard's method for interpolating large sets of scattered data. In particular, the classical point-based weight functions ar...
详细信息
ISBN:
(纸本)9781665469586
The triangular Shepard interpolation method is an extension of the well-known bivariate Shepard's method for interpolating large sets of scattered data. In particular, the classical point-based weight functions are substituted by basis functions built upon triangulation of the scattered points. As shown in the literature, this method exhibits advantages with respect to other interpolation methods for interpolating scattered bivariate data. Nevertheless, as the size of the data set increases, an efficient implementation of the method becomes more and more necessary. In this paper, we present a parallel implementation of the triangular Shepard interpolation method that beside exploiting benefits due to the parallelization itself, introduces a novel approach for the triangulation of the scattered data.
We introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and compu...
详细信息
Frequent subgraph mining(FSM) is a subset of the graph mining domain that is extensively used for graph classification and clustering. Over the past decade, many efficient FSM algorithms have been developed with impro...
详细信息
Frequent subgraph mining(FSM) is a subset of the graph mining domain that is extensively used for graph classification and clustering. Over the past decade, many efficient FSM algorithms have been developed with improvements generally focused on reducing the time complexity by changing the algorithm structure or using parallel programming techniques. FSM algorithms also require high memory consumption, which is another problem that should be solved. In this paper, we propose a new approach called Predictive dynamic sized structure packing(PDSSP) to minimize the memory needs of FSM algorithms. Our approach redesigns the internal data structures of FSM algorithms without making algorithmic modifications. PDSSP offers two contributions. The first is the Dynamic Sized Integer Type, a newly designed unsigned integer data type, and the second is a data structure packing technique to change the behavior of the compiler. We examined the effectiveness and efficiency of the PDSSP approach by experimentally embedding it into two state-of-the-art algorithms, g Span and *** compared our implementations to the performance of the originals. Nearly all results show that our proposed implementation consumes less memory at each support level, suggesting that PDSSP extensions could save memory, with peak memory usage decreasing up to 38% depending on the dataset.
The presentation of Peachy parallel Assignments in several workshops on parallel and distributed computing education aims to promote the reuse of highquality assignments, both saving precious faculty time and improvin...
详细信息
ISBN:
(纸本)9781665497473
The presentation of Peachy parallel Assignments in several workshops on parallel and distributed computing education aims to promote the reuse of highquality assignments, both saving precious faculty time and improving the quality of course assignments. Presented assignments are selected competitively- they must have been successfully used in a real classroom, be easy for other instructors to adopt, and be "cool and inspirational" to encourage students to spend time on them and talk about them with others. Winning assignments are also archived on the Peachy parallel Assignments website. In this installment of Peachy parallel Assignments, we present three new assignments. The first assignment is to simulate an Abelian Sandpile, with grains of sand moving from tall piles to shorter ones. This is a discrete simulation that creates colorful and intricate images. The second assignment is a Big Data problem in which students use the MapReduce paradigm to recreate "Warming Stripes", a visualization of climate data that highlights climate change. The third assignment introduces climate-oriented optimization by asking students to schedule distributed workflows to minimize their carbon footprint.
There has been rapid growth in the field of graphical processing unit (GPU) programming due to the drastic increase in the computing hardware manufacturing. The technology used in these devices is now more affordable ...
详细信息
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, Automotive, Artificial Intelligence, Machine Learning, and other areas necessitates efficient compiler and runtim...
详细信息
ISBN:
(纸本)9781450393393
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, Automotive, Artificial Intelligence, Machine Learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs, etc, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing datacenters. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this paper, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of Whole Function Vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach - the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification. We run experiments across various CPU architectures to attest to the efficacy of our proposed approach.
While parallel hardware has become common, most typical engineers and scientists tend to follow the traditional single-core processing approach causing major drawbacks in their developments. In this study, to fully ut...
详细信息
ISBN:
(纸本)9781665409346
While parallel hardware has become common, most typical engineers and scientists tend to follow the traditional single-core processing approach causing major drawbacks in their developments. In this study, to fully utilize the computing power we present practical parallelization approaches for typical engineers. We implement a BRDF estimation algorithm pursuing parallelism at various levels using CUDA. Experiments with a set of real environmental data show that even a simple parallelization can drastically improve performance: the speedup is between 4.93 and 64.10 depending on the approach in parallelization and the problem size. Little efforts in parallel programming can bring efficient computing.
The graphic processor unit (GPU) is an ideal solution to problems involving parallel data computations. A serial CPU-based program of dynamic analysis for multi-body systems is rebuilt as a parallel program that uses ...
详细信息
The graphic processor unit (GPU) is an ideal solution to problems involving parallel data computations. A serial CPU-based program of dynamic analysis for multi-body systems is rebuilt as a parallel program that uses the GPU's advantages. We developed an analysis code named GMAP to investigate how the dynamic analysis algorithm of multi-body systems is implemented in the GPU parallel programming. The numerical accuracy of GMAP is compared with the commercial program MSC/ADAMS. The numerical efficiency of GMAP is compared with the sequential CPU-based program. Multiple pendulums with bodies and joints and the net-shape system with bodies and spring-dampers are employed for computer simulations. The simulation results indicate that the accuracy of GMAP's solution is the same as that of ADAMS. In the net type system that has 2370 spring-dampers, GMAP indicates an improved efficiency of about 566.7 seconds (24.7% improvement). It is noted that the larger the size of the system, the better the time efficiency.
暂无评论