检索结果-内蒙古大学图书馆

8th IEEE/ACIS international conference on Computer and Information Science

作者： Haroon-Ur-Rashid Feng, Shi Ji Weixing BIT Dept Comp Sci & Technol Beijing Peoples R China

ISBN: (纸本)9780769536415

Any communication model can be well characterized by locality properties and, any topology has its intrinsic, structural, locality characteristics. Spatial locality of processing cores in a multi-core chip can be exploited to gain computational efficiency of a network on chip. In this paper we propose a new criterion in performance evaluation of NoC architecture that is based on the concept of group locality in an interconnection network, the "lower layer complete connect". TriBA a new idea in multi-core architectures and a direct interconnection network (DIN), is compared with 2D Mesh on single chip multi core architecture. TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors so that advantageous features of group locality can be fully and efficiently utilized for getting maximum out of an on-chip interconnection of cores. Cores on the same chip are connected via triplet-based hierarchical interconnection network (thIN), which has simple topology and computing locality characteristic. We have modeled execution of dense matrix and sorting algorithm on an on-chip multi core interconnection network. Our results show that Triplet Based interconnection architecture has strong spatial locality characteristics in comparison to the conventional 2D mesh. the computational efficiency of Triplet based interconnection is remarkable when the number of processing cores increase substantially.

关键词： multi-core interconnection network computational efficiency speedup parallel time scientific applications

来源：评论

学校读者我要写书评

暂无评论

A FPGA-based parallel Architecture for Scalable High-Speed Packet Classification

A FPGA-based Parallel Architecture for Scalable High-Speed P...

引用

20th IEEE international conference on Application-Specific Systems, architectures and Processors

作者： Jiang, Weirong Prasanna, Viktor K. Univ So Calif Ming Hsieh Dept Elect Engn Los Angeles CA 90089 USA

ISBN: (纸本)9781424445523

Multi-field packet classification is a critical function that enables network routers to support a variety of applications such as firewall processing, Quality of Service differentiation, traffic billing, and other value added services. Explosive growth of Internet traffic requires the future packet classifiers be implemented in hardware. However, most of the existing packet classification algorithms need large amount of memory, which inhibits efficient hardware implementations. this paper exploits the modem FPGA technology and presents a partitioning-based parallel architecture for scalable and high-speed packet classification. We propose a coarse-grained independent sets algorithm and then combine it seamlessly with the cross-producting scheme. After partitioning the original rule set into several coarse-grained independent sets and applying the cross-producting scheme for the remaining rules, the memory requirement is dramatically reduced. Our FPGA implementation results show that our architecture can store 10K real-life rules in a single state-of-the-art FPGA while consuming a small amount of on-chip resources. Post place and route results show that the design sustains 90 Gbps throughput for minimum size (40 bytes) packets, which is more than twice the current backbone network link rate.

关键词： FPGA packet classification partitioning

来源：评论

学校读者我要写书评

暂无评论

A Fast and Flexible Sorting Algorithm with CUDA

引用

9th international conference on algorithms and architectures for parallel processing

作者： Chen, Shifu Qin, Jing Xie, Yongming Zhao, Junping Heng, Pheng-Ann Chinese Univ Hong Kong Shenzhen Inst Adv Integrat Technol Chinese Acad Sci Hong Kong Hong Kong Peoples R China Chinese Univ Hong Kong Dept Comp Sci & Engn Hong Kong Hong Kong Peoples R China Chinese Peoples Liberat Army Gen Hosp Postgrad Med Sch Inst Med Informat Beijing Peoples R China

ISBN: (纸本)9783642030949

In this paper, we propose a fast and flexible sorting algorithm with CUDA. the proposed algorithm is Much more practical than the previous GPU-based sorting algorithms, as it is able to handle the sorting of elements represented by integers, floats and structures. Meanwhile, our algorithm is optimized for the modern GPU architecture to obtain high performance. We use different strategies for sorting disorderly list and nearly-sorted list to make it adaptive. Extensive experiments demon- strate our algorithm has higher performance than previous GPU-based sorting algorithms and can support real-time applications.

关键词： parallel sorting algorithm CUDA GPU-based sorting algorithm

来源：评论

学校读者我要写书评

暂无评论

Programming Abstractions and Toolchain for Dataflow Multithreading architectures

Programming Abstractions and Toolchain for Dataflow Multithr...

引用

8th international Symposium on parallel and Distributed Computing

作者： Stavrou, Kyriakos Pavlou, Demos Nikolaides, Marios Petrides, Panayiotis Evripidou, Paraskevas Trancoso, Pedro Popovic, Zdravko Giorgi, Roberto Department of Computer Science University of Cyprus Cyprus Cyprus Department of Information Engineering University of Siena Italy Universitat Politecnica de Catalunya (UPC) Spain University of Illinois at Urbana-Champaign (UIUC) United States

ISBN: (纸本)9780769536804

the need to exploit multi-core systems for parallel processing has revived the concept of dataflow. In particular the Dataflow Multithreading architectures have proven to be good candidates for these systems. In this work we propose an abstraction layer that enables compiling and running a program written for an abstract Dataflow Multithreading architecture on different implementations. More specifically, we present a set of compiler directives that provide the programmer with the means to express most types of dependencies between code segments. In addition, we present the corresponding toolchain that transforms this code into a form that can be compiled for different implementations of the model. As a case study for this work, we present the usage of the toolchain for the TFlux and DTA architectures.

关键词： Abstracting

来源：评论

学校读者我要写书评

暂无评论

AN EFFICIENT SORTING ALGORIthM WIth CUDA

引用

JOURNAL OF thE CHINESE INSTITUTE OF ENGINEERS 2009年第7期32卷 915-921页

作者： Chen, Shifu Qin, Jing Xie, Yongming Zhao, Junping Heng, Pheng-Ann Chinese Univ Hong Kong Chinese Acad Sci Shenzhen Inst Adv Integrat Technol Hong Kong Hong Kong Peoples R China Chinese Univ Hong Kong Dept Comp Sci & Engn Hong Kong Hong Kong Peoples R China Chinese PLA Gen Hosp & Postgrad Med Sch Inst Med Informat Beijing Peoples R China

An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture with the capability of sorting elements represented by integers, floats and structures, while the new merging method gives a way to merge two ordered lists efficiently on GPU without using the slow atomic functions and uncoalesced memory read. Adaptive strategies are used for sorting disorderly or nearly-sorted lists, large or small lists. the current implementation is on NVIDIA CUDA with multi-GPUs support, and is being migrated to the new born Open Computing Language (OpenCL). Extensive experiments demonstrate that our algorithm has better performance than previous GPU-based sorting algorithms and can support real-time applications.

关键词： parallel sorting parallel merging CUDA

来源：评论

学校读者我要写书评

暂无评论

A Massively parallel Coprocessor for Convolutional Neural Networks

A Massively Parallel Coprocessor for Convolutional Neural Ne...

引用

20th IEEE international conference on Application-Specific Systems, architectures and Processors

作者： Sankaradas, Murugan Jakkula, Venkata Cadambi, Srihari Chakradhar, Srimat Durdanovic, Igor Cosatto, Eric Graf, Hans Peter NEC Labs Amer Inc Princeton NJ USA

ISBN: (纸本)9781424445523

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. the coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. the coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

关键词： Convolutional neural networks

来源：评论

学校读者我要写书评

暂无评论

RECURSIVE DUAL-NET: A NEW VERSATILE NETWORK FOR SUPERCOMPUTERS OF thE NEXT GENERATION

引用

JOURNAL OF thE CHINESE INSTITUTE OF ENGINEERS 2009年第7期32卷 931-938页

作者： Li, Yamin Peng, Shietung Chu, Wanming Hosei Univ Dept Comp Sci Tokyo 1848584 Japan Univ Aizu Dept Comp Hardware Aizu Wakamatsu Fukushima 9658580 Japan

In this paper, we propose a new versatile network, called a recursive dual-net (RDN), as a potential candidate for the interconnection network of supercomputers of the next generation. the RDN is based on recursive dual-construction of a base network. A k-level recursive dual construction for k > 0 creates a network containing (2m)2(k)/2 nodes with node-degree d + k, where in and d are the number of nodes and the node-degree of the base network, respectively. the RDN is node and edge symmetric if the base network is node and edge symmetric. the RDN can contain a huge number of nodes, each with small node-degree and short diameter. For example, we can construct a symmetric RDN connecting more than 3-million nodes with only 6 links per node and a diameter of 22. We investigate the topological properties of the RDN and compare them to those of other networks including 3D torus, WK-recursive network, hypercube, cube-connected-cycle, and dual-cube. We also establish the efficient routing and broadcasting algorithms for the RDN.

关键词： parallel processing interconnection network

来源：评论

学校读者我要写书评

暂无评论

Flat MPI vs. Hybrid: Evaluation of parallel Programming Models for Preconditioned Iterative Solvers on "T2K Open Supercomputer"

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Mode...

引用

38th international conference on parallel processing

作者： Nakajima, Kengo Univ Tokyo Ctr Informat Technol Bunkyo Ku Tokyo 1138658 Japan

ISBN: (纸本)9781424449231

In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. the developed code has been tested on the "T2K Open Supercomputer (Todai Combined Cluster)" using up to 512 cores. Performance of Hybrid 4x4 parallel programming model is competitive with that of flat MPI using appropriate command lines for NUMA control. Furthermore, reordering of the mesh data for contiguous access to memory with first touch data placement provides excellent improvement on performance of Hybrid 8x2 and 16x1, especially if the problem size for each core is relatively small. thus, hybrid parallel programming model could be a reasonable choice for large-scale computing of sparse linear solvers on multi-core/multi-socket architectures, such as "T2K Open Supercomputer".

关键词： Supercomputers

来源：评论

学校读者我要写书评

暂无评论

A Novel Architecture of Vision Chip for Fast Traffic Lane Detection and FPGA Implementation

A Novel Architecture of Vision Chip for Fast Traffic Lane De...

引用

IEEE 8th international conference on ASIC

作者： Li, Yuan-Jin Zhang, WanCheng Wu, Nan-Jian Chinese Acad Sci Inst Semicond State Key Lab Superlattices & Microstruct Beijing 100083 Peoples R China

ISBN: (纸本)9781424438686

this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE array processor performs low-level pixel-parallel image processing at high speed and outputs image features for high-level image processing without I/O bottleneck. the dual-core processor carries out high-level image processing. A parallel fast lane detection algorithm for this architecture is developed. the FPGA system with a CMOS image sensor is used to implement the architecture. Experiment results show that the system can perform the fast traffic lane detection at 50fps rate. It is much faster than previous works and has good robustness that can operate in various intensity of light. the novel architecture of vision chip is able to meet the demand of real-time lane departure warning system.

关键词： Vision Chip Safety Driving Assist Lane Detection Dual-Core processing Element Array

来源：评论

学校读者我要写书评

暂无评论

parallel Enhanced Low Design Effort H.264/AVC Fractional Motion Estimation Engine for Super Hi-Vision Application

Parallel Enhanced Low Design Effort H.264/AVC Fractional Mot...

引用

IEEE 8th international conference on ASIC

作者： Huang, Yiqing Ikenaga, Takeshi Waseda Univ Tokyo 8070803 Japan

ISBN: (纸本)9781424438686

One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the algorithm level. Secondly, two parallel improved schemes, called 16-pel scale processing and MB-split assignment, are given out in hardware level, which reduces design effort to only 217MHz. Moreover, sub-sampling technique is adopted during SATD (sum-of-absolute-transformed-difference) generation, which saves 75% hardware cost. By using TSMC 0.18um in worst work conditions (1.62V, 125 degrees C), our FME engine can achieve SHV 4kx4k@60fps real-time processing with 547.5k gates hardware.

关键词： Super Hi-Vision FME H.264/AVC

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：