An effective data-parallel programming environment will use a variety of tools that support the development of efficient data-parallel programs while insulating the programmer from the intricacies of the explicitly pa...
详细信息
An effective data-parallel programming environment will use a variety of tools that support the development of efficient data-parallel programs while insulating the programmer from the intricacies of the explicitly parallel code.
Modern optimizing compilers use several passes over a program's intermediate representation to generate good code. Many of these optimizations exhibit a phase-ordering problem. Getting the best code may require it...
详细信息
Modern optimizing compilers use several passes over a program's intermediate representation to generate good code. Many of these optimizations exhibit a phase-ordering problem. Getting the best code may require iterating optimizations until a fixed point is reached. Combining these phases can lead to the discovery of more facts about the program, exposing more opportunities for optimization. This article presents a framework for describing optimizations. It shows how to combine two such frameworks and how to reason about the properties of the resulting framework. The structure of the framework provides insight into when a combination yields better results. To make the ideas more concrete, this article presents a framework for combining constant propagation, value numbering, and unreachable-code elimination. It is an open question as to what other frameworks can be combined in this way.
PC hardware doubles in processing power every two years, or with each new generation, at approximately constant price. But software has not. Sixteen-bit code developed in the 1980s or early '90s maybe slowed two t...
详细信息
PC hardware doubles in processing power every two years, or with each new generation, at approximately constant price. But software has not. Sixteen-bit code developed in the 1980s or early '90s maybe slowed two to 20 times by I/O bottlenecks like VGA graphics, artificial data dependencies, poor memory use, obsolete compilers and libraries, and a host of other factors. Software can be designed to scale more readily with greater hardware power, but programmers typically do not profile their code unless it runs ''too slowly.'' Modern compilers provide excellent executables, but developers must choose the best settings of compiler switches, profilers, and optimized runtime libraries. They must also understand the intricacies and idiosyncrasies of the target hardware-in this case, the Intel 486, Pentium, Pentium Pro processors, and the new MMX technology. They must also consider what types of algorithms lend themselves to optimization and what code optimization techniques are most effective. We consider each of these issues before describing a profiling tool called VTune.
It is demonstrated that optimization techniques incorporated within a silicon compiler for read-only memories (ROMs) can achieve significant yield, power, and speed improvements by minimizing the number of transistors...
详细信息
It is demonstrated that optimization techniques incorporated within a silicon compiler for read-only memories (ROMs) can achieve significant yield, power, and speed improvements by minimizing the number of transistors, drains, and metal interconnections in the ROM. Transistor minimization adopts a heuristic solution to the NP-complete graph partitioning problem with a powerful technique applicable to various ROM design styles and technologies. If diffusion mask personalization is permitted, the design can be further improved by solving the traveling salesman problem to minimize transistor source/drain regions. In table look-up ROMs compiled for 3- and 1.2- mu m CMOS with diffusion mask programming, the compiler eliminated over 45% of the transistors and drains. Test results show that 3- mu m CMOS ROMs have access times between 50 and 70 ns. ROMs with 1.2- mu m features achieve simulated access times below 20 ns. A simple interface allows the optimizing compiler to work easily with other CAD tools such as microcode assemblers.< >
This embedded tool suite lets users make architectural changes to a programmable DSP core on three levels and supports designer-defined instructions and computation units. The entire system is based on configurability...
详细信息
This embedded tool suite lets users make architectural changes to a programmable DSP core on three levels and supports designer-defined instructions and computation units. The entire system is based on configurability through a file-based resource description that drives all the design tools.
Focuses on the classification scheme and retrieval problem of computer software for reusability in Japan. Development of the technique for reusing software components; Steps of code reuse; Levels of reuse of component.
Focuses on the classification scheme and retrieval problem of computer software for reusability in Japan. Development of the technique for reusing software components; Steps of code reuse; Levels of reuse of component.
The author discusses the bottlenecks that impair performance of a computer system and discusses the success of the RISC (reduced-instruction-set computer) approach. He attributes it, at least in part, to the fact that...
详细信息
The author discusses the bottlenecks that impair performance of a computer system and discusses the success of the RISC (reduced-instruction-set computer) approach. He attributes it, at least in part, to the fact that all the seminal work on the RISC chips was carried out in close conjunction with a strong compiler team. He discusses issues that designers of computer systems must consider and examines trends that will affect the optimum design points for future systems. The author then addresses what he refers to as 'soggy software', i.e. the slow pace of progress in software development as compared to hardware, identifying standardization and reuse as necessary components of any solution to the problem.
The Fortran I compiler was the first demonstration that if is possible to automatically generate efficient machine code from high-level languages. It has thus been enormously influential. This article presents a brief...
详细信息
The Fortran I compiler was the first demonstration that if is possible to automatically generate efficient machine code from high-level languages. It has thus been enormously influential. This article presents a brief description of the techniques used in the Fortran I compiler for the parsing of expressions, loop optimization, and register allocation.
This paper describes a tiling technique that can be used by application programmers and optimizing compilers to obtain I/O-efficient versions of regular scientific loop nests. Due to the particular characteristics of ...
详细信息
This paper describes a tiling technique that can be used by application programmers and optimizing compilers to obtain I/O-efficient versions of regular scientific loop nests. Due to the particular characteristics of I/O operations, a straightforward extension of the traditional tiling method to I/O-intensive programs may result in poor I/O performance. Therefore, the technique presented in this paper adapts iteration space tiling for I/O-performing loop nests to deliver high I/O performance. The generated code results in huge savings in the number of I/O calls as well as the volume of data transferred between the disk subsystem and main memory. Our experimental results on the IBM SP-2 distributed-memory message-passing multiprocessor demonstrate that the reduction in these two parameters, namely, the number of I/O calls and the transferred data volume, can lead to a marked decrease in overall execution times of I/O-intensive loop nests. In a number of loop nests extracted from several benchmarks and math libraries, we were able to improve the execution times by an average 42.5% for one data set and by an average 47.4% for another.
Basic block reordering is an important step for profile-guided binary optimization. The state-of-the-art goal for basic block reordering is to maximize the number of fall-through branches. However, we demonstrate that...
详细信息
Basic block reordering is an important step for profile-guided binary optimization. The state-of-the-art goal for basic block reordering is to maximize the number of fall-through branches. However, we demonstrate that such orderings may impose suboptimal performance on instruction and I-TLB caches. We propose a new algorithm that relies on a model combining the effects of fall-through and caching behavior. As details of modern processor caching is quite complex and often unknown, we show how to use machine learning in selecting parameters that best trade off different caching effects to maximize binary performance. An extensive evaluation on a variety of applications, including Facebook production workloads, the open-source compilers Clang and GCC, and SPEC CPU benchmarks, indicate that the new method outperforms existing block reordering techniques, improving the resulting performance of applications with large code size. We have open sourced the code of the new algorithm as a part of a post-link binary optimization tool, BOLT.
暂无评论