The current ubiquity of multi-core processors has brought renewed interest in program parallelization. Logic programs allow studying the parallelization of programs with complex, dynamic data structures with (declarat...
详细信息
The current ubiquity of multi-core processors has brought renewed interest in program parallelization. Logic programs allow studying the parallelization of programs with complex, dynamic data structures with (declarative) pointers in a comparatively simple semantic setting. In this context, automatic parallelizers which exploit and-parallelism rely on notions of independence in order to ensure certain efficiency properties. "Non-strict" independence is a more relaxed notion than the traditional notion of "strict" independence which still ensures the relevant efficiency properties and can allow considerable more parallelism. Non-strict independence cannot be determined solely at run-time ("a priori") and thus global analysis is a requirement. However, extracting non-strict independence information from available analyses and domains is non-trivial. This paper provides on one hand an extended presentation of our classic techniques for compile-time detection of non-strict independence based on extracting information from (abstract interpretation-based) analyses using the now well understood and popular Sharing + Freeness domain. This includes algorithms for combined compile-time/run-time detection which involve special run-time checks for this type of parallelism. In addition, we propose herein novel annotation (parallelization) algorithms, URLP and CRLP, which are specially suited to non-strict independence. We also propose new ways of using the Sharing + Freeness information to optimize how the run-time environments of goals are kept apart during parallel execution. Finally. we also describe the implementation of these techniques in our parallelizing compiler and recall some early performance results. We provide as well an extended description of our pictorial representation of sharing and freeness information. (C) 2009 Elsevier B.V. All rights reserved.
This paper describes a program auto-parallelizer that is based on the component approach to constructing optimizing compilers;the parallelizer is included in the technological chain of gcc. Details of using analytical...
详细信息
This paper describes a program auto-parallelizer that is based on the component approach to constructing optimizing compilers;the parallelizer is included in the technological chain of gcc. Details of using analytical and optimization components for constructing an auto-parallelizer and a parallelization algorithm using the OpenMP library are considered. Finally, we discuss the results of operation of the auto-parallelizer in terms of performance on a subset of problems in the Spec2006 and NAS parallel benchmarks packages.
Multicore processors have been adopted for consumer electronics like portable electronics, mobile phones, car navigation systems, digital TVs and games to obtain high performance with low power consumption. The OSCAR ...
详细信息
ISBN:
(纸本)9780769534718
Multicore processors have been adopted for consumer electronics like portable electronics, mobile phones, car navigation systems, digital TVs and games to obtain high performance with low power consumption. The OSCAR automatic parallelizing compiler has been developed to utilize these multicores easily. Also, a new Consumer Electronics Multicore Application Program Interface (API) to use the OSCAR compiler with native sequential compilers for various kinds of multicores from different vendors has been developed in NEDO (New Energy and Industrial Technology Development Organization) "Multicore Technology for Realtime Consumer Electronics" project with Japanese 6 IT companies. This paper evaluates the parallel processing performance of multimedia applications using this API by the OSCAR compiler on the FR1000 4 VLIW cores multicore processor developed by Fujitsu Ltd, and the RP1 4 SH-4A cores multicore processor jointly-developed by Renesas Technology Corp., Hitachi Ltd. and Waseda University. As the results, the parallel codes generated by the OSCAR compiler using the API give us 3.27 times speedup on average using 4 cores against 1 core on the FR1000 multicore, and 3.31 times speedup on average using 4 cores against 1 core on the RP1 multicore.
Discovering the optimum number of processors and the distribution of data on distributed memory parallel computers for a given algorithm is a demanding task. A memetic algorithm (MA) is proposed here to find the best ...
详细信息
Discovering the optimum number of processors and the distribution of data on distributed memory parallel computers for a given algorithm is a demanding task. A memetic algorithm (MA) is proposed here to find the best number of processors and the best data distribution method to be used for each stage of a parallel program. Steady state memetic algorithm is compared with transgenerational memetic algorithm using different crossover operators and hill-climbing methods. A self-adaptive MA is also implemented, based on a multimeme strategy. All the experiments are carried out on computationally intensive, communication intensive, and mixed problem instances. The MA performs successfully for the illustrative problem instances.
Optimizing compilers relies on program analysis techniques to detect data dependence between program statements. Data dependence testing is a basic step in detecting loop-level parallelism in numerical program. Most s...
详细信息
Optimizing compilers relies on program analysis techniques to detect data dependence between program statements. Data dependence testing is a basic step in detecting loop-level parallelism in numerical program. Most studies indicate that data dependence tests cannot handle nonlinear-expression array subscripts. This study presents an exact dependence test that can handle quadratic expression array subscripts precisely. The proposed method detects whether a quadratic equation is monotonically increasing or decreasing, and then reduces the integer solution interval of each variable by repeated projection. When the effective solution interval for any variable shrinks to empty, no integer solutions exist for this quadratic equation;otherwise, all integer solutions can be found, implying that parallelism of a loop can be exploited. (C) 2007 Elsevier Inc. All rights reserved.
Matlab is one of the most popular computer languages for technical and scientific programming. However, until recently, it has been limited to running on uniprocessors. One strategy for overcoming this limitation is t...
详细信息
ISBN:
(纸本)1424407281
Matlab is one of the most popular computer languages for technical and scientific programming. However, until recently, it has been limited to running on uniprocessors. One strategy for overcoming this limitation is to introduce global distributed arrays, with those arrays distributed across the processors of a parallel machine. In this paper, we describe the compilation technology we have designed for Matlab D, a distributed-array extension of Matlab. Our approach is distinguished by a two-phase compilation technology with support for a rich collection of data distributions. By precompiling array operations and communication steps into Fortran plus MPI, the time to compile an application using those operations is significantly reduced. This paper includes preliminary results that demonstrate that this approach can dramatically improve performance, scaling well to at least 32 processors.
SHAM (Single Instruction Multiple Data) is a processor-architecture classification from Flynn's taxonomy. The concept is that a single instruction set operates on multiple units of data simultaneously. Computers u...
详细信息
ISBN:
(纸本)9780769530352
SHAM (Single Instruction Multiple Data) is a processor-architecture classification from Flynn's taxonomy. The concept is that a single instruction set operates on multiple units of data simultaneously. Computers use this processor architecture are known as array processors or vector processors. Most computers in use today are SISD (single instruction single data) though allowing a single instruction to operate on multiple data can also be applied to a virtual machine that is capable of parallel execution through the use of multi-threading/multi-core processors, or distributed parallel execution on a multi-computer grid. This paper proposes a language structure that applies the SIMD concept to the Java virtual machine. The motive is to reduce the complexity of the code and ease implementation of parallelization by running a single set of instructions concurrently on an entire collection of objects.
Developing parallel software is far more complex than traditional sequential software. An effective approach to deal with the complexity of parallel software is domain-specific programming in an abstraction higher tha...
详细信息
Developing parallel software is far more complex than traditional sequential software. An effective approach to deal with the complexity of parallel software is domain-specific programming in an abstraction higher than general-purpose programming languages. In this paper, we focus on the domain of the applications based on partial differential equations (PDE) and provide a formal framework and methods for PDE compilers to generate parallel iterative codes for the domain. We also provide a PDE compiler optimization to minimize the number of messages between parallel processors. Our framework and methods can be used to build PDE compilers to generate efficient parallel software for PDE-based applications automatically.
Optimizing compilers rely upon program analysis techniques to detect data dependences between program statements. Data dependence information captures the essential ordering constraints of the statements in a program ...
详细信息
Optimizing compilers rely upon program analysis techniques to detect data dependences between program statements. Data dependence information captures the essential ordering constraints of the statements in a program that need to be preserved in order to produce valid optimized and parallel code. Data dependence testing is very important for automatic parallelization, vectorization, and any other code transformation. In this paper, we examine the impact of data dependence analysis in practice. A number of data dependence tests have been proposed in the literature. In each test, there are different trade offs between accuracy and efficiency. We present an experimental evaluation of several data dependence tests, including the Banerjee test, the I-Test, and the Omega test. We compare these tests in terms of data dependence accuracy, compilation efficiency, effectiveness in parallelization, and program execution performance. We analyze the reasons why a data dependence test can be inexact and we explain how the examined tests handle such cases. We run various experiments using the Perfect Club Benchmarks and the scientific library Lapack. We present the measured accuracy of each test and the reasons for any approximation. We compare these tests in terms of efficiency and we analyze the trade offs between accuracy and efficiency. We also determine the impact of each data dependence test on the total compilation time. Finally, we measure the number of loops parallelized by each test and we compare the execution performance of each benchmark on a multiprocessor. Our results indicate that the Omega test is more accurate, but also very inefficient in the cases where the other two tests are inaccurate. In general, the cost of the Omega test is high and uses a significant percentage of the total compilation time. Furthermore, the difference in accuracy of the Omega test over the Banerjee test and the I-Test does not improve parallelization and program execution performance.
We describe and evaluate a novel approach for the automatic parallelization of programs that use pointer-based dynamic data structures, written in Java. The approach exploits parallelism among methods by creating an a...
详细信息
We describe and evaluate a novel approach for the automatic parallelization of programs that use pointer-based dynamic data structures, written in Java. The approach exploits parallelism among methods by creating an asynchronous thread of execution for each method invocation in a program. At compile time, methods are analyzed to determine the data they access, parameterized by their context. A description of these data accesses is transmitted to a run-time system during program execution. The run-time system utilizes this description to determine when a thread may execute, and to enforce dependences among threads. This run-time system is the main focus of this paper. More specifically, the paper details the representation of data accesses in a method and the framework used by the run-time system to detect and enforce dependences among threads. Experimental evaluation of an implementation of the run-time system on a four-processor Sun multiprocessor indicates that close to ideal speedup can be obtained for a number of benchmarks. This validates our approach.
暂无评论