Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some ...
详细信息
Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers. In this article, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability. We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8x over cuBLAS/cuDNN and 1.1x over TVM on GPU when used to optimize deep learning models
The role of computer algebra systems (CASs) is not limited to analyzing and solving mathematical and physical problems. They have also been used as tools in the development process of computer programs, starting from ...
详细信息
The role of computer algebra systems (CASs) is not limited to analyzing and solving mathematical and physical problems. They have also been used as tools in the development process of computer programs, starting from the specification and ending with the coding and testing phases. In this way one can exploit their powerful mathematical capacity during the development phases and, by the other way, take advantage of the speed performance of languages such as FORTRAN or C in the implementation. Among the mathematical features of CASs there are transformations allowing one to optimize the final code instructions. In this paper we show some kind of optimizations that can be done on new or existing algorithms, by extending some techniques that compilers apply currently to optimize the machine code. The results show that the CPU time taken by the optimized code is reduced by a factor that can reach 5. The optimizations are performed with a package built on a well known CAS: Mathematica. (c) 2005 Elsevier B.V. All rights reserved.
optimizing compilers have become an essential component in achieving high levels of performance. Various simple and sophisticated optimizations are implemented at different stages of compilation to yield significant i...
详细信息
optimizing compilers have become an essential component in achieving high levels of performance. Various simple and sophisticated optimizations are implemented at different stages of compilation to yield significant improvements, but little work has been done in characterizing the effectiveness of optimizers, or in understanding where most of this improvement comes from. In this paper we study the performance impact of optimization in the context of our methodology for CPU performance characterization based on the abstract machine model. The model considers all machines to be different implementations of the same high level language abstract machine;in previous research, the model has been used as a basis to analyze machine and benchmark performance. In this paper, we show that our model can be extended to characterize the performance improvement provided by optimizers and to predict the run time of optimized programs, and measure the effectiveness of several compilers in implementing different optimization techniques.
Historically, compilers have operated by applying a fixed set of optimizations in a predetermined order. We call such an ordered list of optimizations a compilation sequence. This paper describes a prototype system th...
详细信息
Historically, compilers have operated by applying a fixed set of optimizations in a predetermined order. We call such an ordered list of optimizations a compilation sequence. This paper describes a prototype system that uses biased random search to discover a program-specific compilation sequence that minimizes an explicit, external objective function. The result is a compiler framework that adapts its behavior to the application being compiled, to the pool of available transformations, to the objective function, and to the target machine. This paper describes experiments that attempt to characterize the space that the adaptive compiler must search. The preliminary results suggest that optimal solutions are rare and that local minima are frequent. If this holds true, biased random searches, such as a genetic algorithm, should find good solutions more quickly than simpler strategies, such as hill climbing.
compilers employ many aggressive code transformations to achieve highly optimized code. However, because of complex target architectures and unpredictable optimization interactions, these transformations may not alway...
详细信息
ISBN:
(纸本)9780769535760
compilers employ many aggressive code transformations to achieve highly optimized code. However, because of complex target architectures and unpredictable optimization interactions, these transformations may not always be beneficial. Current analysis methods measure performance at the application level and ignore optimization effects at the function and loop level. To better measure and understand these effects, we present OptiScope, a compiler independent tool to identify performance opportunities by comparing programs built with different compilers or optimization fags. The analysis includes hundreds of different metrics and uses a novel loop correlation technique for binary programs (produced from the same source by different compilers) to isolate measurements to specific regions. We present several case studies using OptiScope to identify key differences between different compiler suites, versions, and target architectures. The examples demonstrate performance improvement opportunities between 32.5% to 893% on select regions of SPEC 2006 benchmarks.
Historically, compilers have operated by applying a fixed set of optimizations in a predetermined order. We call such an ordered list of optimizations a compilation sequence. This paper describes a prototype system th...
详细信息
Historically, compilers have operated by applying a fixed set of optimizations in a predetermined order. We call such an ordered list of optimizations a compilation sequence. This paper describes a prototype system that uses biased random search to discover a program-specific compilation sequence that minimizes an explicit, external objective function. The result is a compiler framework that adapts its behavior to the application being compiled, to the pool of available transformations, to the objective function, and to the target machine. This paper describes experiments that attempt to characterize the space that the adaptive compiler must search. The preliminary results suggest that optimal solutions are rare and that local minima are frequent. If this holds true, biased random searches, such as a genetic algorithm, should find good solutions more quickly than simpler strategies, such as hill climbing.
The authors discuss the user programming model for the scalable processor architecture (SPARC) and the optimizing compilers for the SPARC-based Sun-4 workstations. They are concerned with two broad areas: how the comp...
详细信息
The authors discuss the user programming model for the scalable processor architecture (SPARC) and the optimizing compilers for the SPARC-based Sun-4 workstations. They are concerned with two broad areas: how the compilers use the architecture and the design of the compilers themselves. They discuss the registers, synthesized instructions, tagged data support, the compilers, the global and peephole optimizers, and compiler performance analysis.< >
The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacte...
详细信息
The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacted through a feedback loop, from the high-level language computer architectures of the 1970s to the RISC machines of the 1980s. In the supercomputing community, the availability of effective vectorizing compilers delivered easy-to-use performance in the 1980s to the present. These compilers were successful at least in part because they could predict poor performance spots in the program and report these to users. This fostered a feedback loop between programmers and compilers to develop high performance programs. Future optimizing compilers for high performance computers and supercomputers will have to take advantage of both feedback loops.
Because they are interpreted, Java executables run slower than their compiled counterparts. The native executable translation (NET) compiler's objective is to optimize the translation of Java bytecode to native ma...
详细信息
Because they are interpreted, Java executables run slower than their compiled counterparts. The native executable translation (NET) compiler's objective is to optimize the translation of Java bytecode to native machine code so that it runs nearly as fast as native code generated directly from a source. The article presents some preliminary results for several large application programs and standard benchmarks. It compares the NET-compiled code performance with Sun's Java VM, Microsoft's Java just-in-time compiler, and equivalent C and C++ programs directly compiled. The results show that the optimizing NET compiler is capable of achieving better performance than the two other bytecode execution methods, in some cases achieving speeds comparable to directly compiled native code.
暂无评论