In order to take advantage of the processing power of current computing platforms, programmers typically need to develop software versions for different target devices. This task is time-consuming and requires signifi...
详细信息
In order to take advantage of the processing power of current computing platforms, programmers typically need to develop software versions for different target devices. This task is time-consuming and requires significant programming and computer architecture expertise. A possible and more convenient alternative is to start with a single high-level description of a program with minimum implementation details, and generate custom implementations according to the target platform. In this paper, we use MATLAB as a high-level programming language and propose a compiler that targets CPU/GPU computing platforms by generating customized implementations in C and OpenCL. We propose a number of compiler techniques to automatically generate efficient C and OpenCL code from MATLAB programs. One of such compiler techniques relies on heuristics to decide when and how to use Shared Virtual Memory (SVM). The experimental results show that our approach is able to generate code that provides significant speedups (eg, geometric mean speedup of 11x for a set of simple benchmarks) using a discrete GPU over equivalent sequential C code executing on a CPU. With more complex benchmarks, for which only some code regions can be parallelized, and are thus offloaded, the generated code achieved speedups of up to 2.2x. We also show the impact of using SVM, specifically fine-grained buffers, and the results show that the compiler is able to achieve significant speedups, both over the versions without SVM and with naive aggressive SVM use, across three CPU/GPU platforms.
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high...
详细信息
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high-level parallel programs with results that are guaranteed to be deterministic (independent of the number of cores and the scheduling being used).
Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists...
详细信息
Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists of compile-time and run-time optimization for specific cases and run-time processing based on the calculus of common-stride section descriptors for general cases, which allows efficient construction of communication schedules for global data movement. As a result of the evaluation of the implementation on the K computer and a common Linux cluster, it is verified to be effective and useful as a compiler feature in most cases. (C) 2020 Elsevier B.V. All rights reserved.
Molecular diffusion plays a vital role in production from fractured reservoirs in all stages of recovery, especially for fractured reservoirs with small matrix sizes and unfavorable wettability conditions. Molecular d...
详细信息
Molecular diffusion plays a vital role in production from fractured reservoirs in all stages of recovery, especially for fractured reservoirs with small matrix sizes and unfavorable wettability conditions. Molecular diffusion can only be simulated by compositional reservoir simulators, which have historically employed a decoupled phase equilibrium-mass transfer model. Regardless of having higher performance, such a model cannot properly simulate intra- and cross-phase molecular diffusion. In the current research, a compositional fractured reservoir simulator, called Osiris, has been developed in C++ using the coupled formulation. After presenting the primary equations and algorithms, the performance of Osiris has been evaluated through a series of case studies. Utilizing MPI, Osiris could keep its runtime reasonable, despite the high computational demand of coupled modeling. Additionally, the simulation results of Osiris clearly prove the precision of the coupled modeling;and considerable effects of diffusive mass transfer on fractured reservoir performance.
Recent years have seen rapid growth in data-driven distributed systems, such as Hadoop MapReduce, Spark, and Dryad. However, the counterparts for high-performance or compute-intensive applications including large-scal...
详细信息
Recent years have seen rapid growth in data-driven distributed systems, such as Hadoop MapReduce, Spark, and Dryad. However, the counterparts for high-performance or compute-intensive applications including large-scale optimizations, modeling, and simulations are still nascent. In this paper, we introduce DtCraft, a modern C++ based distributed execution engine to streamline the development of high-performance parallel applications. Users need no understanding of distributed computing and can focus on high-level developments, leaving difficult details, such as concurrency controls, workload distribution, and fault tolerance handled by our system transparently. We have evaluated DtCraft on both micro-benchmarks and large-scale optimization problems, and shown the promising performance from single multicore machines to clusters of computers. In a particular semiconductor design problem, we achieved 30x speedup with 40 nodes and 15x less development efforts over hand-crafted implementation.
Bulk Synchronous parallel (BSP) is a simple but powerful high-level model for parallel computation. Using BSPlib, programmers can write BSP programs in the general purpose language C. Direct Remote Memory Access (DRMA...
详细信息
ISBN:
(纸本)9781450359337
Bulk Synchronous parallel (BSP) is a simple but powerful high-level model for parallel computation. Using BSPlib, programmers can write BSP programs in the general purpose language C. Direct Remote Memory Access (DRMA) communication in BSPlib is enabled using registrations: associations between the local memories of all processes in the BSP computation. However, the semantics of registration is non-trivial and ambiguously specified and thus its faulty usage is a potential source of errors. We give a formal semantics of BSPlib with which we characterize correct registration. Anticipating a static analysis, we give a simplified programming model that guarantees correct registration usage, drawing upon previous work on textual alignment.
parallel programming skills may require long time to acquire. "Think in parallel" is a skill which requires time, effort, and experience. In this work, we propose to facilitate the learning process in parall...
详细信息
ISBN:
(纸本)9781450371919
parallel programming skills may require long time to acquire. "Think in parallel" is a skill which requires time, effort, and experience. In this work, we propose to facilitate the learning process in parallel programming by using instant messaging by students. Our aim is to find out if students' interaction through instant messaging is beneficial for the learning process. We asked several students of an HPC course of the Master's degree in Computer Science to develop a specific parallel application, each of them using a different application program interface: OpenMP, MPI, CUDA, or OpenCL. Even though the used APIs are different, there are common points in the design process. We proposed to these students to interact with each other by using Gitter, an instant messaging tool for GitHub users. Our analysis of the communications and results demonstrate that the direct interaction of students through the Gitter tool has a positive impact on the learning process.
The dataflow model is gradually becoming the de facto standard for big data applications. While many popular frameworks are built around this model, very little research has been done on understanding its inner workin...
详细信息
ISBN:
(纸本)9781728104669
The dataflow model is gradually becoming the de facto standard for big data applications. While many popular frameworks are built around this model, very little research has been done on understanding its inner workings, which in turn has led to inefficiencies in existing frameworks. It is important to note that understanding the relationship between dataflow and HPC building blocks allows us to address and alleviate many of these fundamental inefficiencies by learning from the extensive research literature in the HPC community. In this paper we present TSet's, the dataflow abstraction of Twister2, which is a big data framework designed for high-performance dataflow and iterative computations. We discuss the dataflow model adopted by TSet's and the rationale behind implementing iteration handling at the worker level. Finally, we evaluate TSet's to show the performance of the framework.
In modern filmmaking industry, image matting has been one of the common tasks in video side effects and the necessary intermediate steps in computer vision. It pulls the foreground object from the background of an ima...
详细信息
In modern filmmaking industry, image matting has been one of the common tasks in video side effects and the necessary intermediate steps in computer vision. It pulls the foreground object from the background of an image by estimating the alpha values. However, the computational speed for matting high resolution images can be significantly slow due to its complexity and computation that is proportional to the size of unknown region. In order to improve the performance, we implement a parallel alpha matting code with OpenMP from existing sequential code for running on the multicore servers. We present and discuss the algorithm and experimentation results from the perspective of the parallel application developer. The development takes less effort, and the results show significant performance improvement of the entire program.
Objectives: The electroencephalographic signal is largely exposed to external disturbances. Therefore, an important element of its processing is its thorough cleaning. Methods: One of the common methods of signal impr...
详细信息
Objectives: The electroencephalographic signal is largely exposed to external disturbances. Therefore, an important element of its processing is its thorough cleaning. Methods: One of the common methods of signal improvement is the independent component analysis (ICA). However, it is a computationally expensive algorithm, hence methods are needed to decrease its execution time. One of the ICA algorithms (fastICA) and parallel computing on the CPU and GPU was used to reduce the algorithm execution time. Results: This paper presents the results of study on the implementation of fastICA, which uses some multi-core architecture and the GPU computation capabilities. Conclusions: The use of such a hybrid approach shortens the execution time of the algorithm.
暂无评论