High-level synthesis (HLS) allows hardware to be directly produced from behavioral description in C/C++, thus accelerating the design process. Loop pipelining is a key transformation of HLS, as it improves the through...
详细信息
High-level synthesis (HLS) allows hardware to be directly produced from behavioral description in C/C++, thus accelerating the design process. Loop pipelining is a key transformation of HLS, as it improves the throughput of the design at the price of a small hardware overhead. However, for small loops, its use often results in a poor hardware utilization due to the pipeline latency overhead. Overlapping the iterations of the whole loop nest instead of only overlapping the innermost loop is a way to overcome this difficulty, but currently available techniques are restricted to perfectly nested loops with constant bounds, involving uniform dependences only. Using the polyhedral model, we extend the applicability of the nested loop pipelining transformation by proposing a new legality check and a new loop correction technique, called polyhedral bubble insertion. This method was implemented in a source-to-source compiler targeting HLS, and results on benchmark kernels show that polyhedral bubble insertion is effective in practice on a much larger class of loop nests.
Performance and scalability optimization of large HPC applications is currently a labor-intensive, manual process with very low productivity. Major difficulties come from the disaggregated environment for HPC applicat...
详细信息
Performance and scalability optimization of large HPC applications is currently a labor-intensive, manual process with very low productivity. Major difficulties come from the disaggregated environment for HPC application development: the compiler is only involved in local decisions (core or multithreaded domain), while a library-based, communication-oriented programming model realizes whole-machine parallelism. Realizing any major global change in such a disaggregated environment is very difficult and involves changing large portions of the source code. We present semi-automated techniques, based on structural analysis and rewriting, for performing global transformations on an HPC application source code. We present two case studies using the Self-Consistent Field (SCF) standalone benchmark as well as the Coupled Cluster (CCSD) module (2.9 million lines of Fortran code), a key module of the NWChem computational chemistry application. We demonstrate how structural rewriting techniques can be used to automate transformations that affect multiple sections of the application's source code. We show that the transformations can be applied in a systematic fashion across the source code bases with minimal manual effort. These transformations improve the scalability of the SCF benchmark by more than two orders of magnitude and the performance of the full CCSD module by a factor of four. (C) 2015 Elsevier B.V. All rights reserved.
Large and complex systems of ordinary differential equations (ODEs) arise in diverse areas of science and engineering, and pose special challenges On a streaming processor owing to the large amount of state they manip...
详细信息
ISBN:
(纸本)9783642152764
Large and complex systems of ordinary differential equations (ODEs) arise in diverse areas of science and engineering, and pose special challenges On a streaming processor owing to the large amount of state they manipulate. We describe a set of domain-specific sourcetransformations on CUDA C that improved performance by x6.7 on a system of ODEs arising in cardiac electrophysiology running on the nVidia GTX-295, without requiring expert knowledge of the CPU. Our transformations should apply to a wide range of reaction-diffusion systems.
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by foldin...
详细信息
ISBN:
(纸本)1595936327
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by folding, thanks to a. ne mappings and modulo operations, reusing memory locations when they contain a value not used later. We describe the algorithms needed to implement such a strategy. Our implementation has two parts. The first part, Bee, uses the source-to-source transformer ROSE to extract from the program all necessary information on the lifetime of array elements and to generate the code after memory reduction. The second part, Cl@k, is a stand-alone mathematical tool dedicated to optimizations on polyhedra, in particular the computation of successive minima and the computation of good admissible lattices, which are the basis for lattice-based memory reuse. Both tools are developed in C++ and use linear programming and polyhedra manipulations. They can be used either for embedded program optimizations, e. g., to limit memory expansion introduced for parallelization, or in high-level synthesis, e. g., to design memories between communicating hardware accelerators.
CheckPointer is a memory access validator for checking spatial and temporal pointer usage errors in multi-threaded applications by tracking meta data and validating pointer dereferences at run-time. The tool uses sour...
详细信息
ISBN:
(纸本)9780769543475
CheckPointer is a memory access validator for checking spatial and temporal pointer usage errors in multi-threaded applications by tracking meta data and validating pointer dereferences at run-time. The tool uses source-to-source transformations implemented with DMS to instrument the source code of the application to be validated with meta data checks. Libraries available only in binary form are handled by using function wrappers that check meta data immediately before calling a library function and update meta data as necessary immediately after the library function returns.
We describe a compilation system for the concurrent programming language Program Composition Notation (PCN). This notation provides a single-assignment programming model that permits concurrent-programming concerns su...
详细信息
We describe a compilation system for the concurrent programming language Program Composition Notation (PCN). This notation provides a single-assignment programming model that permits concurrent-programming concerns such as decomposition, communication, synchronization, mapping, granularity, and load balancing to be addressed separately in a design. PCN is also extensible with programmer-defined operators, allowing common abstractions to be encapsulated and reused in different contexts. The compilation system incorporates a concurrent-transformation system that allows abstractions to be defined through concurrent source-to-source transformations;these convert programmer-defined operators into a core notation. Run-time system techniques allow the core notation to be compiled into a simple concurrent abstract machine which can be implemented in a portable fashion using a run-time library. The abstract machine provides a uniform treatment of single-assignment and mutable data structures, allowing data sharing between concurrent and sequential program segments and permitting integration of sequential C and Fortran code into concurrent programs. This compilation system forms part of a program development toolkit that operates on a wide variety of networked workstations, multicomputers, and shared-memory multiprocessors. The toolkit has been used both to develop substantial applications and to teach introductory concurrent-programming classes, including a freshman course at Caltech.
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by foldin...
详细信息
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by folding, thanks to a. ne mappings and modulo operations, reusing memory locations when they contain a value not used later. We describe the algorithms needed to implement such a strategy. Our implementation has two parts. The first part, Bee, uses the source-to-source transformer ROSE to extract from the program all necessary information on the lifetime of array elements and to generate the code after memory reduction. The second part, Cl@k, is a stand-alone mathematical tool dedicated to optimizations on polyhedra, in particular the computation of successive minima and the computation of good admissible lattices, which are the basis for lattice-based memory reuse. Both tools are developed in C++ and use linear programming and polyhedra manipulations. They can be used either for embedded program optimizations, e. g., to limit memory expansion introduced for parallelization, or in high-level synthesis, e. g., to design memories between communicating hardware accelerators.
Describes a technique for associating rewrite rules with grammar productions so that many high-level transformations of a source file can be generated easily. Sufficiency of the approach in power to deal with a wide c...
详细信息
Describes a technique for associating rewrite rules with grammar productions so that many high-level transformations of a source file can be generated easily. Sufficiency of the approach in power to deal with a wide class of problems arising from practical applications; Construction of a notation with associated language processing tools for annotating a parse tree with transformation rules.
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by foldin...
详细信息
ISBN:
(纸本)9781595936325
We build on prior work on intra-array memory reuse, for which a general theoretical framework was proposed based on lattice theory. Intra-array memory reuse is a way of reducing the size of a temporary array by folding, thanks to affine mappings and modulo operations, reusing memory locations when they contain a value not used later. We describe the algorithms needed to implement such a strategy. Our implementation has two parts. The first part, Bee, uses the source-to-source transformer ROSE to extract from the program all necessary information on the lifetime of array elements and to generate the code after memory reduction. The second part, [email protected], is a stand-alone mathematical tool dedicated to optimizations on polyhedra, in particular the computation of successive minima and the computation of good admissible lattices, which are the basis for lattice-based memory reuse. Both tools are developed in C++ and use linear programming and polyhedra manipulations. They can be used either for embedded program optimizations, e.g., to limit memory expansion introduced for parallelization, or in high-level synthesis, e.g., to design memories between communicating hardware accelerators.
暂无评论