This paper proposes work on an efficient implementation of third order differentiator which is based on the lattice wave digital filter (LWDF). For third order Lattice wave digital differentiator (LWDD), dataflow gra...
详细信息
Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, which is i...
详细信息
Dynamic dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, which is incremented as they go through the loop. Instruction execution is triggered when all input operands with the same tag become available. However, this traditional tagging mechanism often requires the generation of several control instructions to manipulate tags and guarantee the correct match. To address this problem, this work presents three dataflow loop optimisation techniques. The stack-tagged dataflow is a tagging mechanism that uses stacks of tags to reduce control overheads in dataflow. On the other hand, as nested loops may increase the overhead of stack-tag comparison, tag resetting can be used to set the tag to zero whenever it is safe, allowing a one-level reduction at the stack depth. Finally, loop skipping allows to further avoid stack comparison overhead in loops, when the number of iterations can be determined by the compiler. Experimental results show the overhead, drawbacks and benefits for the three optimisations presented. Moreover, the results suggested that a hybrid compiling approach can be used to get the best performance of each technique.
Most system engineers today use graphical representations of a system to communicate its functional and data requirements. The most commonly used representations are the Function flow Block Diagram (FFBD), dataflow D...
详细信息
For modern complex designs it is impossible to fully specify design behavior, and only feasible to verify functionally meaningful scenarios. Hardware Trojans modifying only unspecified functionality are not possible t...
详细信息
ISBN:
(纸本)9781509015580
For modern complex designs it is impossible to fully specify design behavior, and only feasible to verify functionally meaningful scenarios. Hardware Trojans modifying only unspecified functionality are not possible to detect using existing verification methodologies and Trojan detection strategies. We propose a detection methodology for these Trojans by 1) precisely defining "suspicious" unspecified functionality in terms of information leakage, and 2) formulating detection as a satisfiability problem that can take advantage of the recent advances in both boolean and satisfiability modulo theory (SMT) solvers. The formulated detection procedure can be applied to a gate-level design using commercial equivalence checking tools, or directly to the Verilog/VHDL code by reasoning about the satisfiability of SMT expressions built from traversing the data-flow graph. We demonstrate the effectiveness of our approach on an adder coprocessor and a UART communication controller infected with Trojans which process information leaked from the on-chip bus during idle cycles using signals with only partially specified behavior.
Streaming applications are an important class of applications in real-time embedded systems, which usually run under restricted resource constraints and with real-time requirement. They are often modeled with Synchron...
详细信息
ISBN:
(纸本)9781538624319
Streaming applications are an important class of applications in real-time embedded systems, which usually run under restricted resource constraints and with real-time requirement. They are often modeled with Synchronous data flow graphs (SDFGs) or Cyclo-Static data flow graphs (CSDFGs) at the design stage. A proper analysis of the models gives a predictable design for a system. In this paper, we focus on the throughput analysis of (C)SDFGs, taking into account memory constraints. Memory related analysis needs to choose a memory abstraction that decides when the space of consumed data is released and when the required space is claimed. Different memory abstractions may lead to different achievable throughputs. The existing techniques, however, consider only a certain abstraction. If a model is implemented according to other abstractions, the analysis result may not truly evaluate the performance of the system. In this paper, we present a novel unified framework for throughput analysis of memory constrained (C)SDFGs for different abstractions, aiming to provide evaluations matching up to the corresponding implementations. Our methods are exact. Experiments are carried out on several models of real streaming applications and hundreds of synthetic graphs to evaluate the effects and performance of our methods.
This paper presents a toolbox for the automatic generation of asynchronous circuits starting from a dataflow graph description. The toolbox consists of a scheduling and code generation tool. We use traditional schedu...
详细信息
ISBN:
(纸本)9783642241536;9783642241543
This paper presents a toolbox for the automatic generation of asynchronous circuits starting from a dataflow graph description. The toolbox consists of a scheduling and code generation tool. We use traditional scheduling algorithms as for synchronous circuits, but have replaced the implied synchronous controller for an asynchronous distributed control network. The control circuit allows for true asynchronous operation of all digital resources and as a result of its scalable distributed topology allows unlimited resource sharing. The distributed controllers can be created by connecting a small number of pre-designed sub-controllers which are presented in this paper. Prototype IP-blocks of these sub-controller circuits have been designed in a 90nm ASIC design process. Our toolbox is a capable to generate large complex asynchronous solutions, with upto 20 percent power saving, and as least as good latency performance as of synchronous solutions.
All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself -...
详细信息
The Software/Hardware Implementation and Research Architecture (SHIRA) is a C to hardware toolchain developed by the Computer Architecture Research Group (CARG) of the University of Ottawa. The framework and algorithm...
详细信息
In the approaching era of IoT, flexible and low power accelerators have become essential to meet aggressive energy efficiency targets. During the last few decades, Coarse Grain Reconfigurable Arrays (CGRA) have demons...
详细信息
ISBN:
(纸本)9781509015597
In the approaching era of IoT, flexible and low power accelerators have become essential to meet aggressive energy efficiency targets. During the last few decades, Coarse Grain Reconfigurable Arrays (CGRA) have demonstrated high energy efficiency as accelerators, especially for high-performance streaming applications. While existing CGRAs mostly rely on partial and full predication techniques to support conditional branches, inefficient architecture and mapping support for handling control flow limits the use of CGRAs in accelerating either only inner loop bodies, or transformed loops specifically adapted to the target CGRA. This paper proposes a novel CGRA architecture with support for jump and conditional jump instructions and a lightweight global synchronization mechanism to enable complete Control dataflow Graph (CDFG) mapping in an ultra-low-power environment. The architecture is coupled with a complete design flow that efficiently maps applications with heavy control flow starting from a generic C language description. The proposed mapping approach reduces the impact of wasteful instruction issues in the conventional approaches of predication providing an average energy improvement of 1.44× and 1.6× when compared to the state of the art partial and full predication techniques. Moreover, the proposed method achieves an average speed-up up to 21× and an energy improvement up to 50.42× while executing applications with heavy control flow with respect to sequential execution on a low-power embedded CPU, demonstrating its suitability for next generation IoT applications.
Field Programmable Gate Arrays (FPGAs) are proved to be among the most suitable architectures for image processing applications. However, accelerating algorithms using FPGAs is a time-consuming task and needs expertis...
详细信息
ISBN:
(纸本)9783800744435
Field Programmable Gate Arrays (FPGAs) are proved to be among the most suitable architectures for image processing applications. However, accelerating algorithms using FPGAs is a time-consuming task and needs expertise. Whereas the recent advancements in High-Level Synthesis (HLS) promise to solve this problem, today's HLS tools require apt hardware descriptions of algorithms to be able to provide favorable implementations. A solution is developing highly parameterizable and optimized HLS libraries for the fundamental image processing components. Another solution is providing a higher level of abstraction in the form of a Domain-Specific Language (DSL) and a corresponding efficient back end for hardware design. In this paper, we provide a highly efficient and parameterizable C++ library for image processing applications, which would be the cornerstone for both approaches. In our library, nodes of a stream-based dataflow graph can be described as C++ objects for specified functions, and the whole application can be efficiently parallelized just by defining a global constant as the parallelization factor. Moreover, the key hardware design elements, i. e., line buffers and sliding windows with different border handling patterns, can be utilized individually to ease the design of more complicated applications.
暂无评论