While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model...
详细信息
ISBN:
(纸本)9781595938602
While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core.
We present a new pointer and escape analysis. Instead of analyzing the whole program, the algorithm incrementally analyzes only those parts of the program that may deliver useful results. An analysis policy monitors t...
详细信息
ISBN:
(纸本)9781581134148
We present a new pointer and escape analysis. Instead of analyzing the whole program, the algorithm incrementally analyzes only those parts of the program that may deliver useful results. An analysis policy monitors the analysis results to direct the incremental investment of analysis resources to those parts of the program that offer the highest expected optimization return. Our experimental results show that almost all of the objects are allocated at a small number of allocation sites and that an incremental analysis of a small region of the program surrounding each site can deliver almost all of the benefit of a whole-program analysis. Our analysis policy is usually able to deliver this benefit at a fraction of the whole-program analysis cost.
Modulo scheduling algorithms based on optimal solvers have been proposed to investigate and tune the performance of modulo scheduling heuristics. While recent advances have broadened the scope for which the optimal ap...
详细信息
Modulo scheduling algorithms based on optimal solvers have been proposed to investigate and tune the performance of modulo scheduling heuristics. While recent advances have broadened the scope for which the optimal approach is applicable, this approach increasingly suffers from large execution times. In this paper, we propose a more efficient formulation of the modulo scheduling space that significantly decreases the execution time of solvers based on integer linear programs. For example, the total execution time is reduced by a factor of 8.6 when 782 loops from the Perfect Club, SPEC, and Livermore Fortran Kernels are scheduled for minimum register requirements using the more efficient formulation instead of the traditional formulation. Experimental evidence further indicates that significantly larger loops can be scheduled under realistic machine constraints.
Traditional data-oriented programminglanguages such as dataflow languages and stream languages provide a natural abstraction for parallel programming. In these languages, a developer focuses on the flow of data throu...
详细信息
ISBN:
(纸本)9781450300193
Traditional data-oriented programminglanguages such as dataflow languages and stream languages provide a natural abstraction for parallel programming. In these languages, a developer focuses on the flow of data through the computation and these systems free the developer from the complexities of low-level, thread-oriented concurrency primitives. This simplification comes at a cost traditional data-oriented approaches restrict the mutation of state and, in practice, the types of data structures a program can effectively use. Bamboo borrows from work in typestate and software transactions to relax the traditional restrictions of data-oriented programming models to support mutation of arbitrary data structures. We have implemented a compiler for Bamboo which generates code for the TILEPro64 many-core processor. We have evaluated this implementation on six benchmarks: Tracking, a feature tracking algorithm from computer vision;KMeans, a K-means clustering algorithm;Monte Carlo, a Monte Carlo simulation;Filter Bank, a multi-channel filter bank;Fractal, a Mandelbrot set computation;and Series, a Fourier series computation. We found that our compiler generated implementations that obtained speedups ranging from 26.2 x to 61.6 x when executed on 62 cores.
Study shows that it is indeed possible to use declarative formulations in practical systems, when combined judiciously with the appropriate tools of evaluations. This paper describes the implementation of logical form...
详细信息
Study shows that it is indeed possible to use declarative formulations in practical systems, when combined judiciously with the appropriate tools of evaluations. This paper describes the implementation of logical formulations of two existing analysis techniques: groundness analysis of logic programs and strictness analysis of logic programs. The XSB system is used is used as the evaluation tool. Experimental evidence shows that the resultant groundness and strictness analysis system are practical in terms of both time and space.
Conventional dataflow analysis computes information about what facts may or will not hold during the execution of a program. Sometimes it is useful, for program optimization, to know how often or with what probability...
详细信息
Conventional dataflow analysis computes information about what facts may or will not hold during the execution of a program. Sometimes it is useful, for program optimization, to know how often or with what probability a fact holds true during program execution. In this paper, we provide a precise formulation of this problem for a large class of dataflow problems - the class of finite bi-distributive subset problems. We show how it can be reduced to a generalization of the standard dataflow analysis problem, one that requires a sum-over-all-paths quantity instead of the usual meet-overall-paths quantity. We show that Kildall's result expressing the meet-over-all-paths value as a maximal-fixed-point carries over to the generalized setting, We then outline ways to adapt the standard dataflow analysis algorithms to solve this generalized problem, both in the intraprocedural and the interprocedural case.
Over the last five years, graphics cards have become a tempting target for scientific computing, thanks to unrivaled peak performance, often producing a runtime speed-up of x10 to x25 over comparable CPU solutions. Ho...
详细信息
ISBN:
(纸本)9781450312059
Over the last five years, graphics cards have become a tempting target for scientific computing, thanks to unrivaled peak performance, often producing a runtime speed-up of x10 to x25 over comparable CPU solutions. However, this increase can be difficult to achieve, and doing so often requires a fundamental rethink. This is especially problematic in scientific computing, where experts do not want to learn yet another architecture. In this paper we develop a method for automatically parallelising recursive functions of the sort found in scientific papers. Using a static analysis of the function dependencies we identify sets - partitions - of independent elements, which we use to synthesise an efficient GPU implementation using polyhedral code generation techniques. We then augment our language with DSL extensions to support a wider variety of applications, and demonstrate the effectiveness of this with three case studies, showing significant performance improvement over equivalent CPU methods, and similar efficiency to hand-tuned GPU implementations.
In this paper we describe the design of a global machine independent low level optimizer for the Karlsruhe Ada Compiler. We give a short overview on the optimizations and data structures used in the optimizer as well ...
详细信息
This paper describes a method for compiling programs using interprocedural register allocation. A strategy for handling programs built from multiple modules is presented, as well as algorithms for global variable prom...
详细信息
ISBN:
(纸本)9780897913645
This paper describes a method for compiling programs using interprocedural register allocation. A strategy for handling programs built from multiple modules is presented, as well as algorithms for global variable promotion and register spill code motion. These algorithms attempt to address some of the shortcomings of previous interprocedural register allocation strategies. Results are given for an implementation on a single register file RISC-based architecture.
暂无评论