In this paper, we present an error estimation and propagation technique which targets high-level design abstraction through considering data-flow graph (DFG) representation of approximate circuits. The proposed techni...
详细信息
ISBN:
(纸本)9781665494663
In this paper, we present an error estimation and propagation technique which targets high-level design abstraction through considering data-flow graph (DFG) representation of approximate circuits. The proposed technique is utilized for pruning different combinations of exact versus approximate realizations of various operations in the DFG for a design space exploration (DSE) framework The technique relies on the output error estimation of the arithmetic modules of a high-level library by dividing the input range to intervals and then considering different combinations of these intervals (cluster-based estimation of output error). Additionally, the estimation considers the error of the operands. The error for each combination is stored in a look up table (LUT). The efficacy of the proposed method is assessed for two image processing applications. Simulation results show that the framework can efficiently generate the Pareto frontier in the trade-off space of accuracy versus energy efficiency for the two applications.
This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL, consists of a C ++ API and a runtime that abstracts and simplifies the exec...
详细信息
ISBN:
(纸本)9781665469586
This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL, consists of a C ++ API and a runtime that abstracts and simplifies the execution of multi-kernel applications on heterogeneous platforms across multiple devices. The GraphCL approach has three steps. First, the application designer provides a kernel graph. In the second phase, GraphCL computes the execution schedule. After the schedule has been computed, the runtime uses the execution schedule to enqueue in parallel the processing for all system processors. GraphCL takes the kernel dependencies and the processor performance differences into account during the schedule calculation process. By deciding on the schedule, GraphCL transparently manages the order of execution and data transfers for each processor. On two asymmetric workstations, GraphCL achieves an average acceleration of 1.8x compared to the fastest device. GraphCL achieves also for the set of multi-kernel benchmarks an average 24.5% energy reduction compared to the lazy partition heuristic. that uses all the system processors without considering their power usage.
In this paper, we propose a framework, which is called EGAN, for exploring the trade-off between accuracy and energy efficiency in hardware implementation of error resilient applications. EGAN automatically extracts t...
详细信息
ISBN:
(纸本)9781728142074
In this paper, we propose a framework, which is called EGAN, for exploring the trade-off between accuracy and energy efficiency in hardware implementation of error resilient applications. EGAN automatically extracts the Pareto frontier (PF) of approximate implementations of an error resilient application based on the dataflow graph (DFG) of the application as well as the accuracy and energy consumption of the available approximate/exact components. The framework explores different implementation configurations heuristically to find the best energy efficient implementation of the input application under various output accuracies. The proposed framework, which works by generating some random configurations, clustering them and suggesting some neighboring configurations, reduces the search space considerably. As a result, EGAN achieves a significant reduction in the number of explored configurations compared to the exhaustive (exact) approach while achieving near-optimal results. The efficacy of the proposed framework is assessed using three DSP applications consisting of Sobel edge detector, Finite Inverse Response (FIR) filter and Discrete Cosine Transform (DCT). The studies show that in the worst-case (DCT application with 42 components) EGAN takes 89 hours to extract the PF whereas the exact approach takes 5 million years.
With the arising amounts of devices and data that Internet of Things systems are processing nowadays, solutions for computational applications are in high demand. Many concepts targeting more efficient data processing...
详细信息
ISBN:
(纸本)9783030227449;9783030227432
With the arising amounts of devices and data that Internet of Things systems are processing nowadays, solutions for computational applications are in high demand. Many concepts targeting more efficient data processing are arising and among them edge and fog computing are the ones gaining significant interest since they reduce cloud load. In consequence Internet of Things systems are becoming more and more diverse in terms of architecture. In this paper we present Fogflow model and execution environment allowing for organization of data-flow applications to be run on the heterogeneous environments. We propose unified interface for data-flow creation, graph model and we evaluate our concept in the use case of production line model that mimic real-world factory scenario.
In high-level synthesis for real-time systems, it typically employs heterogeneous functional-unit types to achieve high-performance and low-cost designs. In the design phase, it is critical to determine which function...
详细信息
In high-level synthesis for real-time systems, it typically employs heterogeneous functional-unit types to achieve high-performance and low-cost designs. In the design phase, it is critical to determine which functional-unit type to be mapped for each operation in a given application such that the total cost is minimized while the deadline can be met. For a path or tree structured application, existing approaches can obtain the minimum-cost assignment, called "optimal assignment", under which the resultant system satisfies a given timing constraint. However, it is still an open question whether there exist efficient algorithms to obtain the optimal assignment for the directed acyclic graph (DAG), or more generally, the data-flow graph with cycles (cyclic DFG). For DAGs, by analyzing the property of the problem, this paper designs an efficient algorithm to obtain the optimal assignments. For cyclic DFGs, we approach this problem with the combination of retiming technique to thoroughly explore the design space. We formulate a Mixed Integer Linear Programming (MILP) model to give the optimal solution. But because of the high degree of its time complexity, we devise a practical algorithm to obtain near-optimal solutions within a minute. Experimental results show the effectiveness of our algorithms. Specifically, compared with existing techniques, we can achieve 25.70 and 30.23 percent reductions in total cost on DAGs and cyclic DFGs, respectively.
Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two ( ...
详细信息
Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two ( also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the task graph, the runtime estimates and scalability characteristics of the tasks, and the intertask data communication volumes. A locality-conscious scheduling strategy is used to improve intertask data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications and synthetic graphs shows that our algorithm consistently generates schedules with a lower makespan as compared to Critical Path Reduction (CPR) and Critical Path and Allocation (CPA), two previously proposed scheduling algorithms. Our algorithm also produces schedules that have a lower makespan than pure task- and data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.
In the paper "Static Rate-Optimal Scheduling of Iterative data-flow Programs via Optimum Unfolding" [1], the authors stated a perfect-rate property on unfolded data-flow graphs (DFGs) This will be disproven ...
详细信息
In the paper "Static Rate-Optimal Scheduling of Iterative data-flow Programs via Optimum Unfolding" [1], the authors stated a perfect-rate property on unfolded data-flow graphs (DFGs) This will be disproven in the following by a simple counter example.
Code selection is an important task in code generation for programmable processors, where the goal is to find an efficient mapping of machine-independent intermediate code to processor-specific machine instructions. T...
详细信息
Code selection is an important task in code generation for programmable processors, where the goal is to find an efficient mapping of machine-independent intermediate code to processor-specific machine instructions. Traditional approaches to code selection are based on tree parsing, which enables fast and optimal code selection for intermediate code given as a set of data-flow trees. While this approach is generally useful in compilers for general-purpose processors, it may lead to poor code quality in the case of embedded processors. The reason is that the special architectural features of embedded processors require performing code selection on data-flow graphs, which are a more general representation of intermediate code. In this paper, we present data-flow graph-based code selection techniques for two architectural families of embedded processors: media processors with support for SIMD instructions and fixed-point DSPs with irregular data paths. Both techniques exploit the fact that, in the area of embedded systems, high code quality is a much more important goal than high compilation speed. We demonstrate that certain architectural features can only be utilized by graph-based code selection, while in other cases this approach leads to a significant increase in code quality as compared to tree-based code selection.
Loop scheduling is an important problem in parallel processing. The retiming technique reorganizes an iteration;the unfolding technique schedules several iterations together. We combine these two techniques to obtain ...
详细信息
Loop scheduling is an important problem in parallel processing. The retiming technique reorganizes an iteration;the unfolding technique schedules several iterations together. We combine these two techniques to obtain a static schedule with a reduced average computation time per iteration. We first prove that the order of retiming and unfolding is immaterial for scheduling a data-flow graph (DFG). From this nice property, we present a polynomial-time algorithm on the original DFG, before unfolding, to find the minimum-rate static schedule for a given unfolding factor. For the case of a unit-time DFG, efficient checking and retiming algorithms are presented.
Driven by the 'side-effect' environment of sequential von Neumann computing, Input/Output operations have evolved as state operations on shared files. In parallel programs, if multiple instances of an I/O-perf...
详细信息
Driven by the 'side-effect' environment of sequential von Neumann computing, Input/Output operations have evolved as state operations on shared files. In parallel programs, if multiple instances of an I/O-performing process execute concurrently, either the user or the system must synchronize any accesses to shared files. data-flow principles of execution provide an elegant way to ensure at runtime that instructions can be executed asynchronously in a parallel environment. However, while the conventional von Neumann model of interpretation inherited a rigid ordering of instructions, it is the very asynchronous character of the data-flow model of execution which introduces conflicts when 'state' tasks (such as I/O operations) must share common data objects. The scheme presented in this paper, Logical Serialization with Distributed File Pointers (LS-DFP), introduces the two basic I/O operations read and write into the dynamic data-flow graph. However, sequencing I/O operations on the same file based on the availability of data as in 'conventional' data-flow is not possible because the name of the file becomes available simultaneously to ail operations at program initiation. To impose an order, we sequentialize, (logically serialize-LS), the operations according to their lexicographical ordering. Furthermore, several optimizations are introduced that allow the distributed execution of these I/O operations with the use of Distributed File Pointers (DFP). Thus, the LS-DFP scheme can utilize the full level of parallelism of the dynamic data-flow principles of execution.
暂无评论