检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

1,335 篇 会议
41 篇 期刊文献

馆藏范围

1,376 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

811 篇 工学
- 784 篇 计算机科学与技术...
- 348 篇 软件工程
- 281 篇 电气工程
- 259 篇 电子科学与技术（可...
- 122 篇 信息与通信工程
- 36 篇 动力工程及工程热...
- 35 篇 控制科学与工程
- 30 篇 机械工程
- 17 篇 生物工程
- 14 篇 仪器科学与技术
- 12 篇 建筑学
- 12 篇 土木工程
- 10 篇 生物医学工程（可授...
- 9 篇 冶金工程
- 8 篇 化学工程与技术
- 7 篇 光学工程
- 7 篇 材料科学与工程（可...
- 3 篇 农业工程
224 篇 理学
- 184 篇 数学
- 36 篇 物理学
- 20 篇 统计学（可授理学、...
- 19 篇 生物学
- 11 篇 系统科学
- 7 篇 化学
62 篇 管理学
- 50 篇 管理科学与工程(可...
- 34 篇 工商管理
- 13 篇 图书情报与档案管...
16 篇 经济学
- 16 篇 应用经济学
11 篇 法学
- 9 篇 社会学
3 篇 农学
- 3 篇 作物学
2 篇 教育学
1 篇 医学

主题

451 篇 fpga
269 篇 field programmab...
171 篇 field programmab...
27 篇 high-level synth...
26 篇 reconfigurable c...
22 篇 deep learning
22 篇 opencl
20 篇 computer archite...
18 篇 hls
18 篇 routing
18 篇 hardware acceler...
18 篇 hardware
17 篇 fpgas
16 篇 accelerator
16 篇 placement
14 篇 neural networks
14 篇 cnn
14 篇 machine learning
14 篇 convolutional ne...
13 篇 clocks

机构

19 篇 university of ca...
12 篇 tsinghua univers...
12 篇 fudan university
12 篇 imperial college...
11 篇 university of to...
10 篇 peking universit...
10 篇 university of to...
9 篇 university of ce...
9 篇 university of so...
8 篇 univ of californ...
7 篇 university of sc...
7 篇 univ of toronto ...
7 篇 epfl lausanne
7 篇 école polytechni...
7 篇 univ toronto dep...
7 篇 nanyang technolo...
7 篇 tsinghua univ pe...
6 篇 univ british col...
6 篇 univ calif los a...
6 篇 northeastern uni...

作者

37 篇 cong jason
26 篇 rose jonathan
22 篇 jason cong
17 篇 betz vaughn
15 篇 zhang zhiru
14 篇 chen deming
13 篇 ienne paolo
12 篇 chow paul
12 篇 wawrzynek john
11 篇 hauck scott
10 篇 dehon andré
10 篇 luk wayne
10 篇 prasanna viktor ...
10 篇 langhammer marti...
9 篇 jinmei lai
9 篇 anderson jason h...
9 篇 wilton steven j....
9 篇 schmit herman
9 篇 jonathan rose
9 篇 constantinides g...

语言

1,354 篇 英文
21 篇 其他
1 篇 中文

检索条件"任意字段=FPGA 2000: ACM/SIGDA International Symposium on Field Programmable Gate Arrays"

共 1376 条记录，以下是291-300 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Towards Trainable Synthesis for Optimized Circuit Deployment on fpga 29

Towards Trainable Synthesis for Optimized Circuit Deployment...

引用

29th international symposium on Rapid System Prototyping, RSP 2018

作者： Legault, Jean-Philippe Patros, Panagiotis Kent, Kenneth B. Faculty of Computer Science University of New Brunswick FrederictonNB Canada Department of Computer Science University of Waikato Hamilton Waikato New Zealand

ISBN: (纸本)9781538675571

field programmable gate arrays (fpgas) utilize multiple programmable elements and non-programmable blocks. After synthesizing an input Hardware Design Language (HDL) design into a circuit, optimizations are used to discover a satisfactory deployment on a target fpga. HDLs' compound operations, such as addition, can be implemented in various ways and thus, multiple but functionally equivalent circuits can be synthesized. To leverage this, we propose a methodology that first enables configurable synthesis of compound operations. Second, it trains the system using a set of HDL files and architectures to optimize target performance objectives, such as critical path length and power. We prototyped our technique in the open source Verilog-To-Routing (VTR) tool. We subsequently produced two configuration files targeting different deployment objectives;experimental results with the VTR Verilog benchmarks revealed significant improvements. © 2018 IEEE.

关键词： field programmable gate arrays (fpga)

来源：评论

学校读者我要写书评

暂无评论

Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture 19

Xilinx Adaptive Compute Acceleration Platform: VersalTM Arch...

引用

Proceedings of the 2019 acm/sigda international symposium on field-programmable gate arrays

作者： Brian Gaide Dinesh Gaitonde Chirag Ravishankar Trevor Bauer Xilinx Inc. Longmont CO USA Xilinx Inc. San Jose CA USA

ISBN: (纸本)9781450361378

In this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional fpga programmable fabric, software programmable processors and software programmable accelerator engines. ACAP improves over the programmability of traditional reconfigurable platforms by introducing newer compute models in the form of software programmable accelerators and by separating out the data movement architecture from the compute architecture. The Versal architecture includes a host of new capabilities, including a chip-pervasive programmable Network-on-Chip (NoC), Imux Registers, compute shell, more advanced SSIT, adaptive deskew of global clocks, faster configuration, and other new programmable elements as well as enhancements to the CLB and interconnect. We discuss these architectural developments and highlight their key motivations and differences in relation to traditional fpga architectures.

关键词： acap fpga architecture noc versal fpga math engine stacked silicon xilinx adaptable compute acceleration platform ssit fpga cad

来源：评论

学校读者我要写书评

暂无评论

A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an fpga 18

A Lightweight YOLOv2: A Binarized CNN with A Parallel Suppor...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Nakahara, Hiroki Yonekawa, Haruyoshi Fujii, Tomoya Sato, Shimpei Tokyo Inst Technol Tokyo Japan

ISBN: (纸本)9781450356145

A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an fpga, which is faster and has higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for feature extraction and the parallel support vector regression (SVR) for both classification and localization. To our knowledge, this is the first time binarized CNN's have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.

关键词： Convolutional Deep Neural Network Object Detection Binarized Deep Neural Network

来源：评论

学校读者我要写书评

暂无评论

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on fpga 18

Towards a Uniform Template-based Architecture for Accelerati...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Shen, Junzhong Huang, You Wang, Zelong Qiao, Yuran Wen, Mei Zhang, Chunyuan Natl Univ Def Technol Coll Comp Natl Key Lab Parallel & Distributed Proc Changsha Hunan Peoples R China

ISBN: (纸本)9781450356145

Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on fpga. We find accelerating 3D CNNs on fpga to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple fpga platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.

关键词： 3D CNN Winograd Algorithm Uniform Templates

来源：评论

学校读者我要写书评

暂无评论

A Framework for Generating High Throughput CNN Implementations on fpgas 18

A Framework for Generating High Throughput CNN Implementatio...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Zeng, Hanqing Chen, Ren Zhang, Chi Prasanna, Viktor Univ Southern Calif Ming Hsieh Dept Elect Engn Los Angeles CA 90007 USA Univ Southern Calif Dept Comp Sci Los Angeles CA 90007 USA

ISBN: (纸本)9781450356145

We propose a framework to generate highly efficient accelerators for inferencing on fpgas. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic Veri log generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost the throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize fpgas by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable Verilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of 780.6 GOPS, 669.1 GOPS and 552.1 GOPS for AlexNet, VGG16 and FCN-16s respectively. These correspond to 6.8x (AlexNet) and 4.9x (VGG16) improvement compared with the state-of-the-art implementations.

关键词： Convolutional Neural Networks Algorithmic Optimization Hard-ware Mapping Software-Hardware Co-design fpga

来源：评论

学校读者我要写书评

暂无评论

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on fpgas 18

C-LSTM: Enabling Efficient LSTM using Structured Compression...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Wang, Shuo Li, Zhe Ding, Caiwen Yuan, Bo Qiu, Qinru Wang, Yanzhi Liang, Yun Peking Univ Ctr Energy Efficient Comp & Applicat CECA Sch EECS Beijing Peoples R China Syracuse Univ Dept Elect Engn & Comp Sci Syracuse NY USA CUNY Dept Elect Engn New York NY 10021 USA

ISBN: (纸本)9781450356145

Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on fpgas due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on fpgas. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from O(k(2)) to O(k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from O(k(2)) to O(klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on fpgas. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.

关键词： fpga RNNs LSTM compression block-circulant matrix FFT

来源：评论

学校读者我要写书评

暂无评论

FASTCF: fpga-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering 18

FASTCF: FPGA-based Accelerator for STochastic-Gradient-Desce...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Zhou, Shijie Kannan, Rajgopal Min, Yu Prasanna, Viktor K. Univ Southern Calif Los Angeles CA 90089 USA US Army Res Lab Los Angeles CA 90094 USA

ISBN: (纸本)9781450356145

Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an fpga-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on fpga. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs;this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art fpga and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.

关键词： Sparse matrix factorization Training process Bipartite graph representation

来源：评论

学校读者我要写书评

暂无评论

Nuclear Reactor Simulations on OpenCL fpga Platform 19

Nuclear Reactor Simulations on OpenCL FPGA Platform

引用

Proceedings of the 2019 acm/sigda international symposium on field-programmable gate arrays

作者： Zheming Jin Hal Finkel Argonne National Laboratory Lemont IL USA

ISBN: (纸本)9781450361378

field-programmable gate arrays (fpgas) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current fpgas. The maturing high-level synthesis (HLS) tools, such as Intel fpga SDK for OpenCL, provide a streamlined design flow to facilitate parallel application on fpgas. In this paper, we evaluate and optimize the OpenCL implementations of three nuclear reactor simulation applications (XSBench, RSBench, and SimpleMOC kernel) on a heterogeneous computing platform that consists of a general-purpose CPU and an fpga. We introduce the applications, and describe their OpenCL implementations and optimization methods on an Arria10-based fpga platform. Compared with the baseline kernel implementations, our optimizations increase the performance of the three kernels by a factor of 35, 295, and 102, respectively. We compare the performance, power, and performance per watt of the three applications on an Intel Xeon 16-core CPU, an Nvidia Tesla K80 GPU, and an Intel Arria10 GX1150 fpga. The performance per watt on the fpga is competitive. For XSBench, the performance per watt on the fpga is 1.43X higher than that on the CPU, and 2.58X lower than that on the GPU. For RSBench, the performance per watt on the fpga is 3.6X higher than that on the CPU, and 5.8X lower than that on the GPU. For SimpleMOC kernel, the performance per watt on the fpga is 1.74X higher than that on the CPU, and 1.65X lower than that on the GPU.

关键词： fpga nuclear reactor simulations opencl

来源：评论

学校读者我要写书评

暂无评论

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+fpga Platform A Deep Learning Case Study 18

A Customizable Matrix Multiplication Framework for the Intel...

引用

acm/sigda international symposium on field-programmable gate arrays (fpga)

作者： Moss, Duncan J. M. Krishnan, Srivatsan Nurvitadhi, Eriko Ratuszniak, Piotr Johnson, Chris Sim, Jaewoong Mishra, Asit Marr, Debbie Subhaschandra, Suchit Leong, Philip H. W. Intel Corp Santa Clara CA 95051 USA Koszalin Univ Technol Koszalin Poland Univ Sydney Sydney NSW Australia

ISBN: (纸本)9781450356145

General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+fpga platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching;interleaving and block size control;and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and fpga. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.

关键词： neural networks heterogeneous architectures fpga deep learning reduced precision

来源：评论

学校读者我要写书评

暂无评论

Sparse Winograd Convolutional Neural Networks on Small-scale Systolic arrays 19

Sparse Winograd Convolutional Neural Networks on Small-scale...

引用

Proceedings of the 2019 acm/sigda international symposium on field-programmable gate arrays

作者： Feng Shi Haochen Li Yuhe Gao Benjamin Kuschner Song-Chun Zhu University of California Los Angeles Los Angeles CA USA University of Hong Kong Hong Kong China

ISBN: (纸本)9781450361378

The reconfigurability, energy-efficiency, and massive parallelism on fpgas make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement a framework on fpga by combining the sparse Winograd convolution, clusters of small-scale systolic arrays, and a tailored recursive Z-Morton memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on various CNN models show that it achieves very high computation resource utilization, 20x~30x energy efficiency, and more than 5x speedup compared with the dense implementation.

关键词： systolic arrays convolutional neural networks sparse matrix multiplication fpga winograd transform

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共138页 << < 26 27 28 29 30 31 32 33 34 35 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：