processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance ...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of multiply-accumulate (MAC) units.
JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. T...
详细信息
ISBN:
(纸本)9798350393613
JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. These implementations are compared against JPEG implementations on three high-level-language-programmable chip platforms: an Nvidia A100 GPU with an Intel Platinum 8168 (nvJPEG), a TI C66x embedded processor, and an Intel i9-9900 processor (libjpeg-turbo). In addition, an HDL-configurable Xilinx Zynq-7000 FPGA (VISENGI) was compared. All results are scaled to 32nm CMOS, and throughput per area, energy per megapixel encoded, and the comprehensive energy x delay metrics are compared. The KiloCore designs achieve the lowest chip area and the highest normalized throughput per chip area. Among the chips that are programmable by a high-level programming language, the KiloCore achieves up to 4.33x, 31.4x, and 46.4x lower energy dissipation and up to 4.81x, 7,845x, and 14,510x lower energy x delay than the TI, Intel, and Nvidia respectively.
It is known that by parallelizing the Four Russians' algorithm whose sequential time complexity is O(N-3/log N), the product of two N x N, boolean matrices can be calculated in constant time on a linear array with...
详细信息
ISBN:
(纸本)0964345692
It is known that by parallelizing the Four Russians' algorithm whose sequential time complexity is O(N-3/log N), the product of two N x N, boolean matrices can be calculated in constant time on a linear array with a reconfigurable pipelined bus system (LARPBS) with N-3/log N processors, where all communications and computations are performed on the bit level. In this paper, we consider the scalability of the above parallelization. We show that the product of two N x N boolean matrices can be calculated in O(N-3/(p log p)) time on a p-processor LARPBS, for all 1 less than or equal to p less than or equal to N-3/log N. Such a parallelization is costoptimal and achieves linear speedup if p, is at least Omega(N-epsilon) for any epsilon > 0. Our scalable parallelization of the Four Russians' algorithm is applied to solve a number of graph theory problems.
As new genes are sequenced, it is common for molecular biologists to compare the new gene's DNA to known sequences. One simple form of DNA sequence comparison is done by solving the Longest Common Subsequence (LCS...
详细信息
ISBN:
(纸本)9780889866386
As new genes are sequenced, it is common for molecular biologists to compare the new gene's DNA to known sequences. One simple form of DNA sequence comparison is done by solving the Longest Common Subsequence (LCS) problem. In this paper, we propose a parallel algorithm and specialized FPGA-based processor (the associative ASC processor with reconfigurable 2D mesh) to solve the exact and approximate match LCS problems. This solution uses inexpensive hardware and can be reconfigured as new analysis techniques are developed, making it particularly attractive for processing biosequences.
Presented in this paper is a demonstration system that uses a low-power SCAMP-5 256x256 vision-chip to locate and count multiple objects moving at high speed along arbitrary trajectories. The hardware consists of a SC...
详细信息
ISBN:
(纸本)9781467302890
Presented in this paper is a demonstration system that uses a low-power SCAMP-5 256x256 vision-chip to locate and count multiple objects moving at high speed along arbitrary trajectories. The hardware consists of a SCAMP-5 IC, its power supply system and a Xilinx Spartan3 controller. At 100,000fps, the SCAMP-5 chip can locate and readout the coordinates of a single closed-shaped object amongst clutter. At 25,000fps, the IC can readout the coordinates of 5 objects.
In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant rec...
详细信息
ISBN:
(纸本)9780769550886
In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant reconfiguration reorganizes fault-free PEs to a logical topology by changing the interconnection among PEs. This paper develops an efficient heuristic algorithm, denoted as CLA, to construct maximum logical array (MLA) with short interconnects under flexible rerouting schemes. In CLA, two MLAs are generated using an existing algorithm FLX, and are then utilized to produce the target logical array. The middle column of the target logical array is generated by forming the straightest column on an area bounded by two logical columns of the two MLAs. Other columns are generated by forming compact columns on relative areas. The problem of finding a compact logical column on an given area is solved by modeling it as a shortest path problem on a directed graph with weights where both vertices and edges of the graph are associated with nonnegative costs. Experimental results validate the efficiency of the the proposed algorithm. For 128 x 128 host arrays with 40% unavailable PEs, the proposed approach improves existing algorithm up to 44% in terms of interconnection length. In addition, the improvement increases with the increasing fault density, implying that CLA is more scalable than the existing algorithm.
For search-intensive applications such as data mining and bioinformatics, a SIMD processor array on a Chip may be an effective architecture, and if the application is control-intensive, a Multiple SIMD (MSIMD) archite...
详细信息
ISBN:
(纸本)9780889866386
For search-intensive applications such as data mining and bioinformatics, a SIMD processor array on a Chip may be an effective architecture, and if the application is control-intensive, a Multiple SIMD (MSIMD) architecture may further increase processor utilization. In this paper, we describe the implementation of an associative MSIMD architecture on the MASC processor. The MASC processor implemented using FPGAs, is easily scalable, and dynamically assigns tasks to Processing Elements as the program executes.
This paper presents an efficient IrisCode classifier, built from phase features which uses AdaBoost for the selection of Gabor wavelets bandwidths. The final iris classifier consists of a weighted contribution of weak...
详细信息
This paper presents an efficient IrisCode classifier, built from phase features which uses AdaBoost for the selection of Gabor wavelets bandwidths. The final iris classifier consists of a weighted contribution of weak classifiers. As weak classifiers we use three-split decision trees that identify a candidate based on the Levenshtein distance between phase vectors of the respective iris images. Our experiments show that the Levenshtein distance has better discrimination in comparing IrisCodes than the Hamming distance. Our process also differs from existing methods because the wavelengths of the Gabor filters used, and their final weights in the decision function, are chosen from the robust final classifier, instead of being fixed and/or limited by the programmer, thus yielding higher iris recognition rates. A pyramidal strategy for cascading filters with increasing complexity makes the system suitable for real-time operation. We have designed a processor array to accelerate the computation of the Levenshtein distance. The processing elements are simple basic cells, interconnected by relatively short paths, which makes it suitable for a VLSI implementation.
A simple parallel algorithm for generating N-ary reflected Gray codes is presented. The algorithm is derived from the pattern of N-ary reflected Gray codes. The algorithm runs on a linear processor array with a reconf...
详细信息
A simple parallel algorithm for generating N-ary reflected Gray codes is presented. The algorithm is derived from the pattern of N-ary reflected Gray codes. The algorithm runs on a linear processor array with a reconfigurable bus system. A reconfigurable bus system is a bus system whose configuration can be dynamically changed. Recently processor arrays with reconfigurable bus systems were used to solve many problems in constant time. There already exists experimental reconfigurable chips.
An 0(1) time algorithm for string matching is designed on a two-dimensional (n-m+1)xnprocessor array with a reconfigurable bus system, where n and m are the length of text and pattern respectively.
An 0(1) time algorithm for string matching is designed on a two-dimensional (n-m+1)xnprocessor array with a reconfigurable bus system, where n and m are the length of text and pattern respectively.
暂无评论