We introduce another view of group theory in the field of interconnection networks. With this approach it is possible to specify applicationspecific network topologies for permutation data transfers. Routing of data ...
详细信息
ISBN:
(纸本)076951992X
We introduce another view of group theory in the field of interconnection networks. With this approach it is possible to specify applicationspecific network topologies for permutation data transfers. Routing of data transfers is generated and all possible permutation data transfers are guaranteed. We present the approach by means of a kind of SIMD DSP.
Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass...
详细信息
ISBN:
(纸本)9781728171470
Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: array Aware Training (AAT) for efficient training and array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).
This paper describes a design framework for developing application-specific serial array circuits. Starting from a description of the state-transition logic or a fully-parallel architecture, correctness-preserving tra...
详细信息
ISBN:
(纸本)0818629673
This paper describes a design framework for developing application-specific serial array circuits. Starting from a description of the state-transition logic or a fully-parallel architecture, correctness-preserving transformations are employed to derive a wide range of implementations with different space-time trade-offs. The approach has been used in synthesizing designs based on Field-Programmable Gate arrays, and will be illustrated by the development of a number of circuits including sorters and convolvers.
FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We deve...
详细信息
ISBN:
(纸本)9781479919253
FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We develop a stripped-down soft processor ISA to implement specific repetitive operations on graph nodes and edges that are commonly observed in sparse graph computations. In the processing core, we provide hardware support for rapidly fetching and processing state of local graph nodes and edges through spatial address generators and zero-overhead loop iterators. We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions in the soft processor. We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories. We outperform a Microblaze (100MHz on Zedboard) and an NIOS-II/f (100MHz on DE2-115) by 6x (single processor design) as well as the ARMv7 dual-core CPU on the Zynq SoCs by as much as 10x on the Xilinx ZC706 board (100 processor design) across a range of matrix datasets.
This paper presents special-purpose linear array processor architecture for determining longest common subsequences (LCS) of two sequences. The algorithm uses systolic and pipelined architecture suitable for VLSI impl...
详细信息
ISBN:
(纸本)0818629673
This paper presents special-purpose linear array processor architecture for determining longest common subsequences (LCS) of two sequences. The algorithm uses systolic and pipelined architecture suitable for VLSI implementation. The algorithms are also suitable for implementation on parallel machines. We first develop a `greedy' algorithm to determine some of the LCS and then propose a generalization to determine all LCS of the given pair of sequences. Earlier hardware algorithms [Lipton and Lopresti, 85;Mukherjee, 89] were concerned with determining only the length of LCS or the edit distance of two sequences.
In this paper we present a technique and formal model for optimal synthesis of specialized heterogeneous multiprocessors, given task flow graphs to be executed in a pipelined (periodic) fashion. SOS is a formal approa...
详细信息
In this paper we present a technique and formal model for optimal synthesis of specialized heterogeneous multiprocessors, given task flow graphs to be executed in a pipelined (periodic) fashion. SOS is a formal approach to system synthesis using mixed integer-linear programming, ensuring optimally of the final solutions. SOS was extended to cover the pipelined design style. The extensions were made while trying to avoid a considerable increase in computation time over the non-pipelined case. The extensions include new binary variables as well as new constraints used to ensure numerical convergence. The present tool supports minimization of parameters such as initiation rate, latency and cost.
Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synth...
详细信息
ISBN:
(纸本)9781424469673
Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e. g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2x faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2x speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4-5x faster than the currently used latency-optimized array. These novel arrays are 70-172x faster than a software baseline.
This paper presents the performance evaluation of a fast third-order Volterra digital filtering algorithm mapped onto an AT&T DSP-3 parallel processor. Five different implementations are considered. Speed-up resul...
详细信息
This paper presents the performance evaluation of a fast third-order Volterra digital filtering algorithm mapped onto an AT&T DSP-3 parallel processor. Five different implementations are considered. Speed-up results indicate that the `time-skewing' method is currently the fastest. An application to nonlinear communication channel equalization using a 64-QAM signal constellation is presented.
There have been some interesting technology developments in the area of configurable and extensible processors in the last few years. This paper outlines some of the most recent technologies that we have been developi...
详细信息
ISBN:
(纸本)0769526829
There have been some interesting technology developments in the area of configurable and extensible processors in the last few years. This paper outlines some of the most recent technologies that we have been developing at Tensilica, including the role of fixed implementations, automatic generation of application-oriented configurations, and design methodologies including fast functional simulation. It also discusses some of our future evolution - in particular, the move from a single processor focus to a multi-processor SoC (MPSoC) focus. We conclude by outlining Some of the research problems that we find of most interest.
Throughput has been widely traditionally recognized as the most popular performance metric for implementation of applicationspecific computations. However, increasingly applications such as embedded controllers impos...
详细信息
Throughput has been widely traditionally recognized as the most popular performance metric for implementation of applicationspecific computations. However, increasingly applications such as embedded controllers impose constraints on both throughput and latency as important metrics of speed. Although throughput alone can be arbitrarily improved for several classes of systems using previously published techniques, none of those approaches are effective when latency constraints are considered. DSP, communications, and control systems are often either linear, or have subsystems that are linear. Recently, an optimal technique for simultaneous optimization of throughput and latency of linear computations was introduced in [Sri94]. However, in many cases this technique introduces significant area overhead. In this paper we apply certain key aspects of that technique (on-arrival-processing and maximally fast implementation of linear computations) with exploration of state-space based transformations to develop four synthesis techniques which generate high throughput, low latency, low area, and low power applicationspecificprocessors for the special case of single input linear computations. The new transformation techniques can also be used to increase the implementation efficiency while achieving the same latency and throughput as the original design - we obtained large improvements in area and power on many benchmarks when using the proposed transformations in this alternate role.
暂无评论