检索结果-内蒙古大学图书馆

34th International Conference on Field-Programmable Logic and Applications (FPL)

作者： Kabir, M. D. Arafat Kamucheka, Tendayi Fredricks, Nathaniel Mandebi, Joel Bakos, Jason Huang, Miaoqing Andrews, David Univ Arkansas Dept Elect Engn & Comp Sci Fayetteville AR 72701 USA Univ South Carolina Dept Comp Sci & Engn Columbia SC USA Adv Micro Devices Inc AMD Santa Clara CA USA

ISBN: (纸本)9798331530082;9798331530075

processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of multiply-accumulate (MAC) units.

关键词： Processing-in-Memory System Design Block RAM GEMV engine processor array

来源：评论

学校读者我要写书评

暂无评论

A Scalable JPEG Encoder on a Many-Core array 16

A Scalable JPEG Encoder on a Many-Core Array

引用

16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

作者： Abbott, Thomas Baas, Bevan Univ Calif Davis ECE Dept Davis CA 95616 USA

ISBN: (纸本)9798350393613

JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. These implementations are compared against JPEG implementations on three high-level-language-programmable chip platforms: an Nvidia A100 GPU with an Intel Platinum 8168 (nvJPEG), a TI C66x embedded processor, and an Intel i9-9900 processor (libjpeg-turbo). In addition, an HDL-configurable Xilinx Zynq-7000 FPGA (VISENGI) was compared. All results are scaled to 32nm CMOS, and throughput per area, energy per megapixel encoded, and the comprehensive energy x delay metrics are compared. The KiloCore designs achieve the lowest chip area and the highest normalized throughput per chip area. Among the chips that are programmable by a high-level programming language, the KiloCore achieves up to 4.33x, 31.4x, and 46.4x lower energy dissipation and up to 4.81x, 7,845x, and 14,510x lower energy x delay than the TI, Intel, and Nvidia respectively.

关键词： many-core JPEG encoder processor array

来源：评论

学校读者我要写书评

暂无评论

Scalable boolean matrix multiplication with applications on optical buses

Scalable boolean matrix multiplication with applications on ...

引用

5th Joint Conference on Information Sciences (JCIS 2000)

作者： Li, KQ SUNY Coll New Paltz Dept Math & Comp Sci New Paltz NY 12561 USA

ISBN: (纸本)0964345692

It is known that by parallelizing the Four Russians' algorithm whose sequential time complexity is O(N-3/log N), the product of two N x N, boolean matrices can be calculated in constant time on a linear array with a reconfigurable pipelined bus system (LARPBS) with N-3/log N processors, where all communications and computations are performed on the bit level. In this paper, we consider the scalability of the above parallelization. We show that the product of two N x N boolean matrices can be calculated in O(N-3/(p log p)) time on a p-processor LARPBS, for all 1 less than or equal to p less than or equal to N-3/log N. Such a parallelization is costoptimal and achieves linear speedup if p, is at least Omega(N-epsilon) for any epsilon > 0. Our scalable parallelization of the Four Russians' algorithm is applied to solve a number of graph theory problems.

关键词： Boolean matrix multiplication optical bus processor array reconfigurability scalability strong component transitive closure

来源：评论

学校读者我要写书评

暂无评论

Solving the longest common subsequence (LCS) problem using the associative ASC processor with reconfigurable 2D mesh

Solving the longest common subsequence (LCS) problem using t...

引用

18th IASTED International Conference on Parallel and Distributed Computing and Systems

作者： Virdi, Sabegh Singh Wang, Hong Walker, Robert A. Kent State Univ Dept Comp Sci Kent OH 44242 USA

ISBN: (纸本)9780889866386

As new genes are sequenced, it is common for molecular biologists to compare the new gene's DNA to known sequences. One simple form of DNA sequence comparison is done by solving the Longest Common Subsequence (LCS) problem. In this paper, we propose a parallel algorithm and specialized FPGA-based processor (the associative ASC processor with reconfigurable 2D mesh) to solve the exact and approximate match LCS problems. This solution uses inexpensive hardware and can be reconfigured as new analysis techniques are developed, making it particularly attractive for processing biosequences.

关键词： SIMD associative computing processor array longest common subsequence sequence analysis biosequences

来源：评论

学校读者我要写书评

暂无评论

Locating High Speed Multiple Objects using a SCAMP-5 Vision-Chip

Locating High Speed Multiple Objects using a SCAMP-5 Vision-...

引用

13th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA)

作者： Carey, Stephen J. Barr, David R. W. Wang, Bin Lopich, Alexey Dudek, Piotr Univ Manchester Sch Elect Engn & Elect Manchester M13 7PL Lancs England

ISBN: (纸本)9781467302890

Presented in this paper is a demonstration system that uses a low-power SCAMP-5 256x256 vision-chip to locate and count multiple objects moving at high speed along arbitrary trajectories. The hardware consists of a SCAMP-5 IC, its power supply system and a Xilinx Spartan3 controller. At 100,000fps, the SCAMP-5 chip can locate and readout the coordinates of a single closed-shaped object amongst clutter. At 25,000fps, the IC can readout the coordinates of 5 objects.

关键词： Vision Chip processor array SIMD Smart Sensors

来源：评论

学校读者我要写书评

暂无评论

Constructing Compact Logical arrays under Flexible Rerouting Schemes

Constructing Compact Logical Arrays under Flexible Rerouting...

引用

15th IEEE International Conference on High Performance Computing and Communications (HPCC) /11th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC)

作者： Jiang, Guiyuan Wu, Jigang Sun, Jizhou Gao, Yiyi Tianjin Univ Sch Comp Sci & Technol Tianjin 300072 Peoples R China Tianjin Polytech Univ Sch Comp Sci & Software Engn Tianjin 300387 Peoples R China

ISBN: (纸本)9780769550886

In a multiprocessor array, some processing elements (PEs) fail to function normally due to hardware defects or soft faults caused by overheating, overload or occupancy by other running applications. Fault-tolerant reconfiguration reorganizes fault-free PEs to a logical topology by changing the interconnection among PEs. This paper develops an efficient heuristic algorithm, denoted as CLA, to construct maximum logical array (MLA) with short interconnects under flexible rerouting schemes. In CLA, two MLAs are generated using an existing algorithm FLX, and are then utilized to produce the target logical array. The middle column of the target logical array is generated by forming the straightest column on an area bounded by two logical columns of the two MLAs. Other columns are generated by forming compact columns on relative areas. The problem of finding a compact logical column on an given area is solved by modeling it as a shortest path problem on a directed graph with weights where both vertices and edges of the graph are associated with nonnegative costs. Experimental results validate the efficiency of the the proposed algorithm. For 128 x 128 host arrays with 40% unavailable PEs, the proposed approach improves existing algorithm up to 44% in terms of interconnection length. In addition, the improvement increases with the increasing fault density, implying that CLA is more scalable than the existing algorithm.

关键词： processor array reconfiguration compact array interconnection length interconnection networks

来源：评论

学校读者我要写书评

暂无评论

Implementing a multiple-instruction-stream associative MASC processor

Implementing a multiple-instruction-stream associative MASC ...

引用

18th IASTED International Conference on Parallel and Distributed Computing and Systems

作者： Wang, Hong Walker, Robert A. Univ Toledo Dept Engn Technol 2801 W Bancroft St Toledo OH 43606 USA Kent State Univ Dept Comp Sci Kent OH 44242 USA

ISBN: (纸本)9780889866386

For search-intensive applications such as data mining and bioinformatics, a SIMD processor array on a Chip may be an effective architecture, and if the application is control-intensive, a Multiple SIMD (MSIMD) architecture may further increase processor utilization. In this paper, we describe the implementation of an associative MSIMD architecture on the MASC processor. The MASC processor implemented using FPGAs, is easily scalable, and dynamically assigns tasks to Processing Elements as the program executes.

关键词： SIMD multiple SIMD (MSIMD) associative computing processor array processor design FPGA

来源：评论

学校读者我要写书评

暂无评论

IRIS RECOGNITION USING ADABOOST AND LEVENSHTEIN DISTANCES

引用

INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE 2012年第2期26卷 1266001-1266001页

作者： Climent, Joan Hexsel, Roberto A. Univ Politecn Cataluna Comp Engn & Automat Control Dept Barcelona Spain Univ Fed Parana UFPR Dept Informat BR-80060000 Curitiba Parana Brazil

This paper presents an efficient IrisCode classifier, built from phase features which uses AdaBoost for the selection of Gabor wavelets bandwidths. The final iris classifier consists of a weighted contribution of weak classifiers. As weak classifiers we use three-split decision trees that identify a candidate based on the Levenshtein distance between phase vectors of the respective iris images. Our experiments show that the Levenshtein distance has better discrimination in comparing IrisCodes than the Hamming distance. Our process also differs from existing methods because the wavelengths of the Gabor filters used, and their final weights in the decision function, are chosen from the robust final classifier, instead of being fixed and/or limited by the programmer, thus yielding higher iris recognition rates. A pyramidal strategy for cascading filters with increasing complexity makes the system suitable for real-time operation. We have designed a processor array to accelerate the computation of the Levenshtein distance. The processing elements are simple basic cells, interconnected by relatively short paths, which makes it suitable for a VLSI implementation.

关键词： Iris recognition AdaBoost biometrics Levenshtein distance string matching processor array

来源：评论

学校读者我要写书评

暂无评论

A PARALLEL ALGORITHM TO GENERATE N-ARY REFLECTED GRAY CODES IN A LINEAR array WITH RECONFIGURABLE BUS SYSTEM

引用

Parallel Processing Letters 1993年第2期3卷 157-164页

作者： P. THANGAVEL V.P. MUTHUSWAMY Department of Mathematics Bharathidasan University Tiruchirapalli-620 024 India Department of Mathematics and Computer Applications Regional Engineering College Tiruchirapalli-620 015 India

A simple parallel algorithm for generating N-ary reflected Gray codes is presented. The algorithm is derived from the pattern of N-ary reflected Gray codes. The algorithm runs on a linear processor array with a reconfigurable bus system. A reconfigurable bus system is a bus system whose configuration can be dynamically changed. Recently processor arrays with reconfigurable bus systems were used to solve many problems in constant time. There already exists experimental reconfigurable chips.

关键词： processor array reconfigurable bus system Gray code

来源：评论

学校读者我要写书评

暂无评论

An 0(1) time algorithm for string matching

引用

International journal of computer mathematics 1992年第3-4期42卷 185-191页

作者： Gen-Huey Chen

An 0(1) time algorithm for string matching is designed on a two-dimensional (n-m+1)xnprocessor array with a reconfigurable bus system, where n and m are the length of text and pattern respectively.

关键词： processor array reconfigurable bus system string matching F.2.2 G.1.0

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：