ABSTRACTAlthough the teaching of programming has evolved over 50 years, all methodologies rely on a simple structure that was born a long time ago: the loop, shared by all high-level programming languages, and th...
详细信息
ABSTRACTAlthough the teaching of programming has evolved over 50 years, all methodologies rely on a simple structure that was born a long time ago: the loop, shared by all high-level programming languages, and the preferred choice for any repetitive task programmers face. We analyze here how “loops” skew the way programmers solve problems, and prevent them from taking advantage of the available parallel/distributed computing architectures. To do so, we state our initial hypothesis: eliminating loops will allow a more natural parallel programming approach. The idea is to mimic a common practice today that was established in the past for a different purpose: prohibiting goto statements to improve code maintainability. This paper describes a new computer programming teaching strategy that we tested for 7 years and provides evidence on how loop prohibition, in the context of Functional programming, makes students aware of data dependencies and produces 21st-century programmers who benefit from widely available parallel architectures.
While the frame rate is higher and the image size is larger, sequence images processing is harder. Good real-time can be ensured by the multi-core DSP in the embedded image processing system. TMS320C6670 which is the ...
详细信息
In order to take advantage of the processing power of current computing platforms, programmers typically need to develop software versions for different target devices. This task is time-consuming and requires signifi...
详细信息
In order to take advantage of the processing power of current computing platforms, programmers typically need to develop software versions for different target devices. This task is time-consuming and requires significant programming and computer architecture expertise. A possible and more convenient alternative is to start with a single high-level description of a program with minimum implementation details, and generate custom implementations according to the target platform. In this paper, we use MATLAB as a high-level programming language and propose a compiler that targets CPU/GPU computing platforms by generating customized implementations in C and OpenCL. We propose a number of compiler techniques to automatically generate efficient C and OpenCL code from MATLAB programs. One of such compiler techniques relies on heuristics to decide when and how to use Shared Virtual Memory (SVM). The experimental results show that our approach is able to generate code that provides significant speedups (eg, geometric mean speedup of 11x for a set of simple benchmarks) using a discrete GPU over equivalent sequential C code executing on a CPU. With more complex benchmarks, for which only some code regions can be parallelized, and are thus offloaded, the generated code achieved speedups of up to 2.2x. We also show the impact of using SVM, specifically fine-grained buffers, and the results show that the compiler is able to achieve significant speedups, both over the versions without SVM and with naive aggressive SVM use, across three CPU/GPU platforms.
Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists...
详细信息
Global data movement is the most general, and therefore important, function of inter-node communication in the partitioned global address space programming models, such as XcalableMP. Our implementation of it consists of compile-time and run-time optimization for specific cases and run-time processing based on the calculus of common-stride section descriptors for general cases, which allows efficient construction of communication schedules for global data movement. As a result of the evaluation of the implementation on the K computer and a common Linux cluster, it is verified to be effective and useful as a compiler feature in most cases. (C) 2020 Elsevier B.V. All rights reserved.
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high...
详细信息
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high-level parallel programs with results that are guaranteed to be deterministic (independent of the number of cores and the scheduling being used).
Recent years have seen rapid growth in data-driven distributed systems, such as Hadoop MapReduce, Spark, and Dryad. However, the counterparts for high-performance or compute-intensive applications including large-scal...
详细信息
Recent years have seen rapid growth in data-driven distributed systems, such as Hadoop MapReduce, Spark, and Dryad. However, the counterparts for high-performance or compute-intensive applications including large-scale optimizations, modeling, and simulations are still nascent. In this paper, we introduce DtCraft, a modern C++ based distributed execution engine to streamline the development of high-performance parallel applications. Users need no understanding of distributed computing and can focus on high-level developments, leaving difficult details, such as concurrency controls, workload distribution, and fault tolerance handled by our system transparently. We have evaluated DtCraft on both micro-benchmarks and large-scale optimization problems, and shown the promising performance from single multicore machines to clusters of computers. In a particular semiconductor design problem, we achieved 30x speedup with 40 nodes and 15x less development efforts over hand-crafted implementation.
Molecular diffusion plays a vital role in production from fractured reservoirs in all stages of recovery, especially for fractured reservoirs with small matrix sizes and unfavorable wettability conditions. Molecular d...
详细信息
Molecular diffusion plays a vital role in production from fractured reservoirs in all stages of recovery, especially for fractured reservoirs with small matrix sizes and unfavorable wettability conditions. Molecular diffusion can only be simulated by compositional reservoir simulators, which have historically employed a decoupled phase equilibrium-mass transfer model. Regardless of having higher performance, such a model cannot properly simulate intra- and cross-phase molecular diffusion. In the current research, a compositional fractured reservoir simulator, called Osiris, has been developed in C++ using the coupled formulation. After presenting the primary equations and algorithms, the performance of Osiris has been evaluated through a series of case studies. Utilizing MPI, Osiris could keep its runtime reasonable, despite the high computational demand of coupled modeling. Additionally, the simulation results of Osiris clearly prove the precision of the coupled modeling;and considerable effects of diffusive mass transfer on fractured reservoir performance.
Bulk Synchronous parallel (BSP) is a simple but powerful high-level model for parallel computation. Using BSPlib, programmers can write BSP programs in the general purpose language C. Direct Remote Memory Access (DRMA...
详细信息
ISBN:
(纸本)9781450359337
Bulk Synchronous parallel (BSP) is a simple but powerful high-level model for parallel computation. Using BSPlib, programmers can write BSP programs in the general purpose language C. Direct Remote Memory Access (DRMA) communication in BSPlib is enabled using registrations: associations between the local memories of all processes in the BSP computation. However, the semantics of registration is non-trivial and ambiguously specified and thus its faulty usage is a potential source of errors. We give a formal semantics of BSPlib with which we characterize correct registration. Anticipating a static analysis, we give a simplified programming model that guarantees correct registration usage, drawing upon previous work on textual alignment.
In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be ...
详细信息
ISBN:
(纸本)9783030105495;9783030105488
In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock frequency of the cores, etc.) to tune the performance and power consumption of the application. Different from most of the existing approaches, we target sequential stream processing applications by proposing a solution based on C++ annotations. The user specifies which parts of the code to parallelize and what type of requirements should be enforced on that part of the code. Our solution first automatically parallelizes the annotated code and then applies self-adaptation approaches at run-time to enforce the user-expressed objectives. We ran experiments on different real-world applications, showing its simplicity and effectiveness.
The dataflow model is gradually becoming the de facto standard for big data applications. While many popular frameworks are built around this model, very little research has been done on understanding its inner workin...
详细信息
ISBN:
(纸本)9781728104669
The dataflow model is gradually becoming the de facto standard for big data applications. While many popular frameworks are built around this model, very little research has been done on understanding its inner workings, which in turn has led to inefficiencies in existing frameworks. It is important to note that understanding the relationship between dataflow and HPC building blocks allows us to address and alleviate many of these fundamental inefficiencies by learning from the extensive research literature in the HPC community. In this paper we present TSet's, the dataflow abstraction of Twister2, which is a big data framework designed for high-performance dataflow and iterative computations. We discuss the dataflow model adopted by TSet's and the rationale behind implementing iteration handling at the worker level. Finally, we evaluate TSet's to show the performance of the framework.
暂无评论