Loop bounds are often unknown until run time, making it difficult to analyze non-functional properties such as latency at compile-time. Similarly, static allocations of processing resources to loop computations might ...
详细信息
ISBN:
(纸本)9781538661956
Loop bounds are often unknown until run time, making it difficult to analyze non-functional properties such as latency at compile-time. Similarly, static allocations of processing resources to loop computations might be too conservative with respect to given performance requirements, or not optimal with respect to the energy consumption. To still satisfy requirements when accelerating loop nests under this uncertainty of loop bounds, we formalize and propose an approach to run-time requirement enforcement: at run time, select a mapping among a set of candidates that satisfies a given set of requirements while optimizing secondary objectives. Because the candidate search space of suitable mappings might be prohibitively large to evaluate at run time, we further introduce two approaches to reduce its cardinality: 1) architecture-specific reduction by solving for parts of the mapping from the requirements, and 2) design-time reduction by finding a k-subset of mappings that maximizes the number of loop bounds where the requirements are satisfied. We implemented our proposed run-time requirement enforcement techniques for a representative class of programmable processor array architecture called tightly coupled processor arrays (TCPAs) and demonstrate their effectiveness with a case study. The case study shows the effectiveness of our approach: We can satisfy given latency requirements while easily saving up to 10% in energy.
THE MOST EFFICIENT BRANCH PREDICTORS EXPLOIT BOTH GLOBAL BRANCH HISTORY AND LOCAL HISTORY, BUT LOCAL HISTORY PREDICTORS INTRODUCE MAJOR DESIGN CHALLENGES. DRAWING FROM RECENT WORK ON MULTIDIMENSIONAL BRANCH PREDICTION...
详细信息
THE MOST EFFICIENT BRANCH PREDICTORS EXPLOIT BOTH GLOBAL BRANCH HISTORY AND LOCAL HISTORY, BUT LOCAL HISTORY PREDICTORS INTRODUCE MAJOR DESIGN CHALLENGES. DRAWING FROM RECENT WORK ON MULTIDIMENSIONAL BRANCH PREDICTION, THE AUTHORS INTRODUCE A PRACTICAL, COST-EFFECTIVE MECHANISM FOR OVERCOMING THE CHALLENGES OF MANAGING LOCAL HISTORIES: THE INNERMOST-LOOP ITERATION COUNTER.
The technology scaling has initiated two distinct trends that are likely to continue into future: first, the increased parallelism in hardware and second, the increasing performance and energy cost of communication re...
详细信息
ISBN:
(纸本)9781467392860
The technology scaling has initiated two distinct trends that are likely to continue into future: first, the increased parallelism in hardware and second, the increasing performance and energy cost of communication relative to computation. Both of the above trends call for development of compiler and runtime systems to automatically parallelize programs and reduce communication in parallel computations to achieve the desired high performance in an energy-efficient fashion. In this paper, we propose the design of an integrated compiler and runtime system that auto-parallelizes loop-nests to clusters and, a novel communication avoidance method that reduces data movement between processors. Communication minimization is achieved via data replication: data is replicated so that a larger share of the whole data set may be mapped to a processor and hence, non-local memory accesses reduced. Experiments on a number of benchmarks show the effectiveness of the approach.
Despite their popularity, distributed programs remain a major challenge for the computer software verification. The need for methods for assuring safe interactions in such software systems is recognized. In the last f...
详细信息
ISBN:
(纸本)9781467382007
Despite their popularity, distributed programs remain a major challenge for the computer software verification. The need for methods for assuring safe interactions in such software systems is recognized. In the last few years, several new approaches have been proposed to solve the problem. Recent works have focused on developing behavior type systems to enforce the correct implementation of protocols, but this type systems are developed for languages with first class primitives for linear communication channels and communication-oriented control flow. In general for GPLs (general purpose programming languages), it is difficult to guarantee the correct implementation of protocol. In this paper, we propose to present an automated verification mechanism to ensure the protocol implementation correctness with respect to a session type specification. To support automatic verification, we design an entailment checking procedure which can handle the verification of a general purpose imperative programming language. Our theory establishes a framework for semantically precise enforcement of protocols, by extending the Separation Logic static analysis technique with a protocol verification mechanism.
In this paper, we present a compilation flow for HPC kernels on the REDEFINE coarse-grain reconfigurable architecture (CGRA). REDEFINE is a scalable macro-dataflow machine in which the compute elements (CEs) communica...
详细信息
ISBN:
(纸本)9781479989379
In this paper, we present a compilation flow for HPC kernels on the REDEFINE coarse-grain reconfigurable architecture (CGRA). REDEFINE is a scalable macro-dataflow machine in which the compute elements (CEs) communicate through messages. REDEFINE offers the ability to exploit high degree of coarse-grain and pipeline parallelism. The CEs in REDEFINE are enhanced with reconfigurable macro data-paths called HyperCells that enable exploitation of fine-grain and pipeline parallelism at the level of basic instructions in static dataflow order. Application kernels that exhibit regularity in computations and memory accesses such as affine loop nests benefit from the architecture of HyperCell [1], [2]. The proposed compilation flow aims at exposing high degree of parallelism in loop nests in HPC application kernels using polyhedral analysis and generates meta-data to effectively utilize the computational resources in HyperCells. Memory is explicitly managed through compiler's assistance. We address the compilation challenges such as partitioning with load balancing, mapping and scheduling computations and management of operand data while targeting multiple HyperCells in the REDEFINE architecture. The proposed solution scales well meeting the performance objectives of HPC computing.
Many embedded applications such as multimedia, signal processing and wireless communications present a streaming processing behavior. In order to take full advantage of modern multi-and many-core embedded platforms, t...
详细信息
ISBN:
(纸本)9781479989379
Many embedded applications such as multimedia, signal processing and wireless communications present a streaming processing behavior. In order to take full advantage of modern multi-and many-core embedded platforms, these applications have to be parallelized by describing them in a given parallel Model of Computation (MoC). One of the most prominent MoCs is Kahn Process Network (KPN) as it allows to express multiple forms of parallelism and it is suitable for efficient mapping and scheduling onto parallel embedded platforms. However, describing streaming applications manually in a KPN is a challenging task. Especially, since they spend most of their execution time in loops with unbounded number of iterations. These loops are in several cases implemented as while loops, which are difficult to analyze. In this paper, we present an approach to guide the derivation of KPNs from embedded streaming applications dominated by multiple types of while loops. We evaluate the applicability of our approach on an eight DSP core commercial embedded platform using realistic benchmarks. Results measured on the platform showed that we are able to speedup sequential benchmarks on average by a factor up to 4.3x and in the best case up to 7.7x. Additionally, to evaluate the effectiveness of our approach, we compared it against a state-of-the-art parallelization framework.
We present the source-to-source TRACO compiler allowing for increasing program locality and parallelizing arbitrarily nested loop sequences in numerical applications. Algorithms for generation of tiled code and extrac...
详细信息
ISBN:
(纸本)9788360810668
We present the source-to-source TRACO compiler allowing for increasing program locality and parallelizing arbitrarily nested loop sequences in numerical applications. Algorithms for generation of tiled code and extracting synchronization-free slices composed of tiles are presented. Parallelism of arbitrary nested loops is obtained by creating a kernel of computations represented in the OpenMP standard to be executed independently on many CPUs. We consider benchmarks, typical from compute-intensive sequences of algebra operations or numerical computation from industry and engineering. The speed-up of programs generated by TRACO are discussed. Related compilers and techniques are considered. Future work is outlined.
EKEKO is a Clojure library for applicative logic meta-programming against an Eclipse workspace. EKEKO has been applied successfully to answering program queries (e.g., "does this bug pattern occur in my code?&quo...
详细信息
ISBN:
(纸本)9781479937523
EKEKO is a Clojure library for applicative logic meta-programming against an Eclipse workspace. EKEKO has been applied successfully to answering program queries (e.g., "does this bug pattern occur in my code?"), to analyzing project corpora (e.g., "how often does this API usage pattern occur in this corpus?"), and to transforming programs (e.g., "change occurrences of this pattern as follows") in a declarative manner. These applications rely on a seamless embedding of logic queries in applicative expressions. While the former identify source code of interest, the latter associate error markers with, compute statistics about, or rewrite the identified source code snippets. In this paper, we detail the logic and applicative aspects of the EKEKO library. We also highlight key choices in their implementation. In particular, we demonstrate how a causal connection with the Eclipse infrastructure enables building development tools interactively on the Clojure read-eval-print loop.
For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, a...
详细信息
ISBN:
(纸本)9781450330510
For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this paper we propose a compiler-hardware co-operative approach that is flexible and yet able to generate efficient mappings for imperfect nested loops. It is based on loop flattening, but to mitigate the negative impact of flattening we combine loop fission and a light-weight architecture extension that is designed to accelerate common operation patterns appearing frequently in flattened loops. Our experimental results using imperfect loops from multimedia and DSP domains demonstrate that our special operations can cover a large portion of nested loop operations, improve performance of nested loops by nearly 30% over using loop flattening only, and achieve near-ideal executions on CGRAs for imperfect loops.
暂无评论