We present a classification scheme for array language primitives that quantifies the variation in parallelism and data locality that results from the fusion of any two primitives. We also present an algorithm based on...
详细信息
We present a classification scheme for array language primitives that quantifies the variation in parallelism and data locality that results from the fusion of any two primitives. We also present an algorithm based on this scheme that efficiently determines when it is beneficial to fuse any two primitives. Experimental results show that five LINPACK routines report 50% performance improvement from the fusion of array operators.
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel (R) C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragm...
详细信息
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel (R) C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs, (c) 2005 Elsevier B.V. All rights reserved.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops a...
详细信息
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls);however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. This work considers the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions or outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. We concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem.
Programming language constructs generally operate on data words, and so does most compiler analysis and transformation. However, individual word-level operations often harbor pointless, yet resource and power hungry, ...
详细信息
ISBN:
(纸本)9781538674666
Programming language constructs generally operate on data words, and so does most compiler analysis and transformation. However, individual word-level operations often harbor pointless, yet resource and power hungry, lower-level operations. By transforming complete programs into gate-level operations on individual bits, and optimizing operations at that level, it is possible to dramatically reduce the total amount of work needed to execute the program's algorithm. This gate-level representation can be in terms of any complete set of logic gate types;earlier work targeted conventional multiplexor gates, but the work reported here centers on targeting CSWAP (FredKin) gates without fanout - a form that can be implemented on a quantum computer. This paper will overview the approach, describe the current state of the prototype compiler, and suggest some ways in which compiler automatic parallelization technology might be extended to allow ordinary programs to take advantage of the unique properties of quantum computers.
Automatic parallelization of general-purpose programs is still not possible in general in the presence of irregular data structures and complex control-flows. One promising strategy is tread-level data speculation (TL...
详细信息
ISBN:
(纸本)9780769504254
Automatic parallelization of general-purpose programs is still not possible in general in the presence of irregular data structures and complex control-flows. One promising strategy is tread-level data speculation (TLDS). Although TLDS alleviates the need of proving independent computations statically, studies show that applying TLDS blindly to programs with limited speculative parallelism may lead to performance degradation. Therefore, a positive approach is to combine TLDS with strong compiler analyses. The compiler can provide a guideline of where to speculate by "lazily" detecting some dependences and leave dependences that are more dynamic to be detected at runtime. Furthermore, transformations can be applied to eliminate some of the dependences detected by the compiler to enhance speculative parallelism in the program. This paper proposes compiler techniques to implement this approach. In particular, we focus on general-purpose Java programs with extensive use of containers that refer to any general-purpose aggregate data structures.
暂无评论