Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult program...
详细信息
ISBN:
(纸本)9783642195945
Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Optimizing inter-processor(PE) communication is crucial for parallelizing compilers for message-passing parallel machines to achieve high performance. In this paper, we;propose a technique to eliminate redundant inter...
详细信息
Optimizing inter-processor(PE) communication is crucial for parallelizing compilers for message-passing parallel machines to achieve high performance. In this paper, we;propose a technique to eliminate redundant inter-PE messages. This technique utilizes data-flow analysis to find a definition point that corresponds to a use point where the definition and the use occur in different PEs. If several read accesses occurred in the same PE use the data defined at the same definition point in another PE, redundant inter-PE messages are eliminated as Follows: only one inter-PE communication is performed for the earliest read access and the previously received data are used for the following read. In order to guarantee the consistency of the data, a valid flag and a sent nag are provided for each chunk of received data. The control of these flags is equivalent to the coherence control by the self invalidation on a compiler aided cache coherence scheme.
This paper presents a model for automatically parallelizing compiler based on C++ which consists of compile-time and run-time parallelizing *** paper also describes a method for finding both intra-object and inter-obj...
详细信息
This paper presents a model for automatically parallelizing compiler based on C++ which consists of compile-time and run-time parallelizing *** paper also describes a method for finding both intra-object and inter-object parallelism. The parallelism detection is completely transparent to users.
The current influx of networked workstations has prompted people to use this platform as a multiprocessing environment. In addition, tools like the Parallel Virtual Machine (PVM) has fuelled the growth even further. I...
详细信息
The current influx of networked workstations has prompted people to use this platform as a multiprocessing environment. In addition, tools like the Parallel Virtual Machine (PVM) has fuelled the growth even further. In this work we present the design and some possible future strategies for automatically parallelizing sequential programs using a compilation tool called PACWON for a network of workstations (NOW). The sequential programs are written using a subset of C - without pointers and structures. The target language is C embedded with PVM library calls. The automatically parallelized programs are run on a NOW environment. (C) 1998 Elsevier Science B.V. All rights reserved.
The purpose of this article is to design and implement a performing compiler for parallelizing Java application with divide-and-conquer algorithm. The compiler is built around Java ForkJoin framework, which is directl...
详细信息
ISBN:
(纸本)9781467352857
The purpose of this article is to design and implement a performing compiler for parallelizing Java application with divide-and-conquer algorithm. The compiler is built around Java ForkJoin framework, which is directly integrated within Java 1.7 version and imported as archive library in Java 1.6 and 1.5 versions. This compiler tends to make easier and less error-prone the parallelization of recursive applications. Although in Java ForkJoin Framework there are two user-level performance parameters, which are the number of threads and the threshold, our compiler introduces another user-level performance parameter which is the MaxDepth corresponding to the maximum of depth after which, only sequential execution is enforced. This allows balancing between fine-grain and coarse-grain parallelisms. Experimental results are presented and discussed.
When loops in sequential programs written in procedural programming languages are parallelized, automatic parallelizing compilers should perform data dependence analysis in order to preserve constraints by data refere...
详细信息
ISBN:
(纸本)1892512459
When loops in sequential programs written in procedural programming languages are parallelized, automatic parallelizing compilers should perform data dependence analysis in order to preserve constraints by data reference order. Although loops with dependences as determined by a dependence analyzer cannot be parallelized as-is, in general, some can be parallelized after applying appropriate loop restructuring optimizations. This paper deals with the design and implementation of the loop restructuring feature of our automatic parallelizing compiler, MIRAL And also it shows the evaluation results of several pilot studies by hand-compiling some test programs.
Multithreading support seems to be the most obvious approach for helping programmers to take the advantage of parallelism by operating systems. Although multithreading for a lot of multiprocessors is powerful, we some...
详细信息
Multithreading support seems to be the most obvious approach for helping programmers to take the advantage of parallelism by operating systems. Although multithreading for a lot of multiprocessors is powerful, we sometimes still lack good parallelizing compilers to help programmers exploit parallelism and gain performance benefit. In this paper, a model of FORTRAN parallelizing compiler on multithreading OSF/1 is first proposed and then generalized to be useful hi constructing a parallelizing compiler for a particular language, to generate insight into the development of a high-performance parallelizing compiler.
Effectively utilizing the compute power of modern multi-core machines is a challenging task for a programmer. Automated extraction of shared memory parallelism via powerful compiler transformations and optimizations i...
详细信息
ISBN:
(纸本)9780769545165
Effectively utilizing the compute power of modern multi-core machines is a challenging task for a programmer. Automated extraction of shared memory parallelism via powerful compiler transformations and optimizations is one means to such a goal. However, the effectiveness of such transformations is tied to detailed characteristics of the target computer system. In this paper, we describe an automated system for capturing such computer system characteristics that is based on prior work on various parts of the overall problem. The system characteristics measured include the number of available compute elements available to run threads, multiple memory hierarchy parameters, and functional unit latencies and bandwidths. We show experimental results on a wide range of compute platforms that validate the effectiveness of the overall approach.
With an increasing number of shared memory multicore processor architectures, there is a requirement for supporting multiple architectures in automatic parallelizing compilers. The OSCAR (Optimally Scheduled Advanced ...
详细信息
ISBN:
(纸本)9783030993726;9783030993719
With an increasing number of shared memory multicore processor architectures, there is a requirement for supporting multiple architectures in automatic parallelizing compilers. The OSCAR (Optimally Scheduled Advanced Multiprocessor) automatic parallelizing compiler is able to parallelize many different sequential programs, such as scientific applications, embedded real-time applications, multimedia applications, and more. OSCAR compiler's features include coarse-grain task parallelization with earliest execution condition analysis, analyzing both data and control dependencies, data locality optimizations over different loop nests with data dependencies, and the ability to generate parallelized code using the OSCAR API 2.1. The OSCAR API 2.1 is compatible with OpenMP for SMP multicores, with additional directives for power control and supporting heterogeneous multicores. This allows for a C or Fortran compiler with OpenMP support to generate parallel machine code for the target multicore. Additionally, using the OSCAR API analyzer allows a sequential-only compiler without OpenMP support to generate machine code for each core separately, which is then linked to one parallel application. Overall, only little configuration changes to the OSCAR compiler are needed to run and optimize OSCAR compiler-generated code on a specific platform. This paper evaluates the performance of OSCAR compiler-generated code on different modern SMP multicore processors, including Intel and AMD x86 processors, an Arm processor, and a RISC-V processor using scientific and multimedia benchmarks in C and Fortran. The results show promising speedups on all platforms, such as a speedup of 7.16 for the swim program of the SPEC2000 benchmarks on an 8-core Intel x86 processor, a speedup of 9.50 for the CG program of the NAS parallel benchmarks on 8 cores of an AMD x86 Processor, a speedup of 3.70 for the BT program of the NAS parallel benchmarks on a 4-core RISC-V processor, and a speedup of 2.64 fo
The parallelizing compiler for the B-HIVE loosely-coupled multiprocessor system uses a medium grain model to minimize the communication overhead. A medium grain model is shown to be an optimum way of merging fine grai...
详细信息
The parallelizing compiler for the B-HIVE loosely-coupled multiprocessor system uses a medium grain model to minimize the communication overhead. A medium grain model is shown to be an optimum way of merging fine grain operations into parallel tasks such that the parallelism obtained at the grain level is retained and communication overhead is decreased. A new communication model is introduced in this paper, allowing additional overlap between computation and communication. Simulation results indicate that the medium grain communication model shows promise for automatic parallelization for a loosely-coupled multiprocessor system.
暂无评论