The need for easily extendable programming language tools has all but disappeared with the proliferation of language and paradigms and new compilation tools. A particular area of new language research is the domain of...
详细信息
ISBN:
(纸本)9781450333573
The need for easily extendable programming language tools has all but disappeared with the proliferation of language and paradigms and new compilation tools. A particular area of new language research is the domain of parallel programming which often requires new language abstractions on various levels. We introduce a framework for building extendable compilers with composable abstractions utilizing features of object-oriented design, functional programming and dependent types. We demonstrate our approach with examples from parsing and intermediate representation formats extracted from the compiler for a parallel Replica language and make a brief evaluation of the approach from point of view of developer productivity.
The problem of load balancing arises in parallel mesh-based numerical solution of problems of continuum mechanics, energetics, electrodynamics etc. on high-performance computing systems. The number of processors to ru...
详细信息
The problem of load balancing arises in parallel mesh-based numerical solution of problems of continuum mechanics, energetics, electrodynamics etc. on high-performance computing systems. The number of processors to run a computational problem is often unknown. It makes sense, therefore, to partition a mesh into a great number of microdomains which then are used to create subdomains. Graph partitioning methods implemented in state-of-the-art parallel partitioning tools ParMETIS, Jostle, PT-Scotch and Zoltan are based on multilevel algorithms. That approach has a shortcoming of forming unconnected subdomains. Another shortcoming of present graph partitioning methods is generation of strongly imbalanced partitions. The program package for parallel large mesh decomposition GridSpiderPar was developed. We compared different partitions into microdomains, microdomain graph partitions and partitions into subdomains of several meshes (10 8 vertices, 10 9 elements) obtained by means of the partitioning tool GridSpiderPar and the packages ParMETIS, Zoltan and PT-Scotch. Balance of the partitions, edge-cut and number of unconnected subdomains in different partitions were compared as well as the computational performance of gas-dynamic problem simulations run on different partitions. The obtained results demonstrate advantages of the devised algorithms.
High-level programming languages and domain-specific languages can often benefit from the increased power efficiency of heterogeneous computing. OpenCL can serve as a compiler target for portable code generation and r...
详细信息
ISBN:
(纸本)9781450334846
High-level programming languages and domain-specific languages can often benefit from the increased power efficiency of heterogeneous computing. OpenCL can serve as a compiler target for portable code generation and runtime management. By using OpenCL as the target platform, compiler writers can focus on more important, higher-level problems in language implementation. Such improved productivity can enable a proliferation of high-level programming languages for heterogeneous computing systems. The C++ programming language provides several high-level, developer-friendly features that are missing from OpenCL. These high-level features support software engineering practices and improve developer productivity. With the advent of OpenCL 2.0 and Heterogeneous System Architecture (HSA), more C++ language constructs can be efficiently mapped and executed on multi-core architectures. It is the compiler writerâa¯Zs job to translate these features into the OpenCL constructs without incurring an excessive level of overhead. C++ AMP is a parallel programming extension to C++, and MulticoreWare have contributed to Clamp, an open source implementation. The compiler is based on Clang / LLVM, and could target multiple platforms such as OpenCL / SPIR / HSA. We present some important implementation techniques in this compiler, and we would also present how shared virtual memory, platform atomics could allow more generic C++ codes to leverage multi-core architectures.
Understanding and identifying performance problems is difficult for parallel applications, but is an essential part of software development for parallel systems. In addition to the same problems that exist when analys...
详细信息
ISBN:
(纸本)9781450339100
Understanding and identifying performance problems is difficult for parallel applications, but is an essential part of software development for parallel systems. In addition to the same problems that exist when analysing sequential programs, software development tools for parallel systems must handle the large number of execution engines (cores) that result in different (possibly non-deterministic) schedules for different executions. Understanding where exactly a concurrent program spends its time (esp. if some aspects of the program paths depend on input data) is the first step towards improving program quality. State-of-the-art profilers, however, aid developers in performance diagnosis by providing hotness information at the level of a class or method (function) and usually report data for just a single program execution. This paper presents a profiling and analysis technique that consolidates execution information for multiple program executions. Currently, our tool's focus is on execution time (CPU cycles) but other metrics (stall cycles for functional units, cache miss rates, etc) are possible, provided such data can be obtained from the processor's monitoring unit. To detect the location of performance anomalies that are worth addressing, the average amount of time spent inside a code block, along with the statistical range of the minimum and maximum amount of time spent, is taken into account. The technique identifies performance bottlenecks at the fine-grained level of a basic block. It can indicate the probability of such a performance bottleneck appearing during actual program executions. The technique utilises profiling information across a range of inputs and tries to induce performance bottlenecks by delaying random memory accesses. The approach is evaluated by performing experiments on the data compression tool pbzip2, the multi-threaded download accelerator axel, the open source security scanner Nmap and Apache httpd web server. An experimental evalu
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed-memory parallel programming, it ...
详细信息
Loop transformations are known to be important for performance of compute-intensive programs, and are often used to expose parallelism. However, many transformations involving loops often obfuscate the code, and are c...
详细信息
ISBN:
(纸本)9781450335867
Loop transformations are known to be important for performance of compute-intensive programs, and are often used to expose parallelism. However, many transformations involving loops often obfuscate the code, and are cumbersome to apply by hand. The goal of this paper is to explore alternative methods for expressing parallelism that are more friendly to the programmer. In particular, we seek to expose parallelism without significantly changing the original loop structure. We illustrate how clocks in X10 can be used to express some of the traditional loop transformations, in the presence of parallelism, in a manner that we believe to be less invasive. Specifically, expressing parallelism corresponding to onedimensional affine schedules can be achieved without modifying the original loop structure and/or statements.
Coupling a database and a parallel-programming framework reduces the I/O overhead between them. However, there will be serious issues such as memory bandwidth limitations, load imbalances, and race conditions. Existin...
详细信息
Interactive parallelization Tool (IPT) is a semi-automatic tool that can be used by domain experts and students for transforming certain classes of existing applications into multiple parallel variants. An end-user of...
详细信息
In this work we discuss the problem of teaching programming for Intel Xeon Phi architecture. We present practice-oriented approach accompanied with extensive practical part. Our method has a distinctive feature of com...
详细信息
The growing power of processors allows us to implement increasingly complex multimedia algorithms. However, this processor power is only available if the algorithms are implemented in a way that exploits the multi-cor...
详细信息
ISBN:
(纸本)9781450333528
The growing power of processors allows us to implement increasingly complex multimedia algorithms. However, this processor power is only available if the algorithms are implemented in a way that exploits the multi-core parallelism of these processors. Today, this requires that the skillsets required for algorithm development and for parallel programming are tightly combined to achieve this. By providing a language, compiler and runtime that allows algorithm developers to specify algorithms as a series of data-transforming kernels written in C++, while the parallelization opportunities are built into the compiler and runtime, we hope to alleviate this need for a dual skillset. In this paper, we focus on the performance improvements that our system can achieve by combining language design, compiler knowledge, and runtime decisions to overcome performance bottlenecks from fine-grained kernel scheduling and cache-line contention without adapting the algorithms they implement. Copyright 2015 ACM.
暂无评论