检索结果-内蒙古大学图书馆

A NEW PRESENTATION OF METACUBES FOR ALGORIthMIC DESIGN AND CASE STUDIES: parallel PREFIX COMPUTATION AND parallel SORTING

引用

JOURNAL OF thE CHINESE INSTITUTE OF ENGINEERS 2009年第7期32卷 939-949页

作者： Li, Yamin Peng, Shietung Chu, Wanming Hosei Univ Dept Comp Sci Tokyo 1848584 Japan Univ Aizu Dept Comp Hardware Aizu Wakamatsu Fukushima 9658580 Japan

A versatile family of interconnection networks alternative to hypercubes, called Metacubes, has been proposed for building extremely large scale multiprocessor systems with a small number of links per node. A Metacube MC(k, m) connects 2(2km + k) nodes with only k + in links per node. Metacube can be used to build parallel computing systems of very large scale with a small number of links per node. In this paper, we propose a new presentation of Metacube for algorithmic design. Based on the new presentation, we give efficient algorithms for parallel prefix computation and parallel sorting on Metacubes, respectively. the algorithm for prefix computation runs in 2(k)m (k + 1) + k communication steps and 2(k + 1)m + 2k computation steps on MC(k, m). the sort algorithm runs in O(2(k)m + k)(2) computation steps and O(2(k)m (2k + 1) + k)(2) communication steps on MC(k, m).

关键词： Hypercube Metacube parallel prefix parallel sorting

来源：评论

学校读者我要写书评

暂无评论

A Massively parallel Coprocessor for Convolutional Neural Networks

A Massively Parallel Coprocessor for Convolutional Neural Ne...

引用

20th IEEE international conference on Application-Specific Systems, architectures and Processors

作者： Sankaradas, Murugan Jakkula, Venkata Cadambi, Srihari Chakradhar, Srimat Durdanovic, Igor Cosatto, Eric Graf, Hans Peter NEC Labs Amer Inc Princeton NJ USA

ISBN: (纸本)9781424445523

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. the coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. the coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

关键词： Convolutional neural networks

来源：评论

学校读者我要写书评

暂无评论

Flat MPI vs. Hybrid: Evaluation of parallel Programming Models for Preconditioned Iterative Solvers on "T2K Open Supercomputer"

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Mode...

引用

38th international conference on parallel processing

作者： Nakajima, Kengo Univ Tokyo Ctr Informat Technol Bunkyo Ku Tokyo 1138658 Japan

ISBN: (纸本)9781424449231

In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. the developed code has been tested on the "T2K Open Supercomputer (Todai Combined Cluster)" using up to 512 cores. Performance of Hybrid 4x4 parallel programming model is competitive with that of flat MPI using appropriate command lines for NUMA control. Furthermore, reordering of the mesh data for contiguous access to memory with first touch data placement provides excellent improvement on performance of Hybrid 8x2 and 16x1, especially if the problem size for each core is relatively small. thus, hybrid parallel programming model could be a reasonable choice for large-scale computing of sparse linear solvers on multi-core/multi-socket architectures, such as "T2K Open Supercomputer".

关键词： Supercomputers

来源：评论

学校读者我要写书评

暂无评论

A Novel Architecture of Vision Chip for Fast Traffic Lane Detection and FPGA Implementation

A Novel Architecture of Vision Chip for Fast Traffic Lane De...

引用

IEEE 8th international conference on ASIC

作者： Li, Yuan-Jin Zhang, WanCheng Wu, Nan-Jian Chinese Acad Sci Inst Semicond State Key Lab Superlattices & Microstruct Beijing 100083 Peoples R China

ISBN: (纸本)9781424438686

this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE array processor performs low-level pixel-parallel image processing at high speed and outputs image features for high-level image processing without I/O bottleneck. the dual-core processor carries out high-level image processing. A parallel fast lane detection algorithm for this architecture is developed. the FPGA system with a CMOS image sensor is used to implement the architecture. Experiment results show that the system can perform the fast traffic lane detection at 50fps rate. It is much faster than previous works and has good robustness that can operate in various intensity of light. the novel architecture of vision chip is able to meet the demand of real-time lane departure warning system.

关键词： Vision Chip Safety Driving Assist Lane Detection Dual-Core processing Element Array

来源：评论

学校读者我要写书评

暂无评论

AN EFFICIENT SORTING ALGORIthM WIth CUDA

引用

JOURNAL OF thE CHINESE INSTITUTE OF ENGINEERS 2009年第7期32卷 915-921页

作者： Chen, Shifu Qin, Jing Xie, Yongming Zhao, Junping Heng, Pheng-Ann Chinese Univ Hong Kong Chinese Acad Sci Shenzhen Inst Adv Integrat Technol Hong Kong Hong Kong Peoples R China Chinese Univ Hong Kong Dept Comp Sci & Engn Hong Kong Hong Kong Peoples R China Chinese PLA Gen Hosp & Postgrad Med Sch Inst Med Informat Beijing Peoples R China

An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture with the capability of sorting elements represented by integers, floats and structures, while the new merging method gives a way to merge two ordered lists efficiently on GPU without using the slow atomic functions and uncoalesced memory read. Adaptive strategies are used for sorting disorderly or nearly-sorted lists, large or small lists. the current implementation is on NVIDIA CUDA with multi-GPUs support, and is being migrated to the new born Open Computing Language (OpenCL). Extensive experiments demonstrate that our algorithm has better performance than previous GPU-based sorting algorithms and can support real-time applications.

关键词： parallel sorting parallel merging CUDA

来源：评论

学校读者我要写书评

暂无评论

parallel Enhanced Low Design Effort H.264/AVC Fractional Motion Estimation Engine for Super Hi-Vision Application

Parallel Enhanced Low Design Effort H.264/AVC Fractional Mot...

引用

IEEE 8th international conference on ASIC

作者： Huang, Yiqing Ikenaga, Takeshi Waseda Univ Tokyo 8070803 Japan

ISBN: (纸本)9781424438686

One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the algorithm level. Secondly, two parallel improved schemes, called 16-pel scale processing and MB-split assignment, are given out in hardware level, which reduces design effort to only 217MHz. Moreover, sub-sampling technique is adopted during SATD (sum-of-absolute-transformed-difference) generation, which saves 75% hardware cost. By using TSMC 0.18um in worst work conditions (1.62V, 125 degrees C), our FME engine can achieve SHV 4kx4k@60fps real-time processing with 547.5k gates hardware.

关键词： Super Hi-Vision FME H.264/AVC

来源：评论

学校读者我要写书评

暂无评论

Texture synthesis by interspersing patches in a chessboard pattern

Texture synthesis by interspersing patches in a chessboard p...

引用

VRCAI 2009: 8th international conference on Virtual Reality Continuum and its Applications in Industry

作者： Chen, Xin Wang, Wencheng State Key Laboratory of Computer Science Institute of Software Chinese Academy of Sciences China Graduate University Chinese Academy of Sciences China

ISBN: (纸本)9781605589121

this paper presents a novel parallel algorithm to synthesize textures in patches. It decomposes the synthesis process into two steps by the chessboard pattern, with the first step to place patches in the black grids, and the second step to select suitable patches to fill the white grids, where a grid is in the same size of a patch. this way the placed patches in the first step have very weak constraints between them, and in the second step the selected patches are only dependent on their respective surrounding patches, according to the popularly used MRF model to guide texture synthesis. thus, our proposed algorithm can run in parallel without a hierarchical structure and iterative computation that are always expensive and required in existing parallel synthesis algorithms. At the same time, we adopt a measure to generate patches that can well reflect the periodic feature variation in the sample texture, so as to have texture features transited consistently between neighboring patches for high-quality synthesis. the results show that our proposed algorithm greatly promotes texture synthesis, being able to real time produce large textures in high quality, e.g., generating a texture in 2048*2048 pixels at over 55 fps on a common personnel computer. Copyright © 2009 by the Association for Computing Machinery, Inc.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

STARPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore architectures

引用

15th international Euro-Par conference on parallel Computing

作者： Augonnet, Cedric thibault, Samuel Namyst, Raymond Wacrenier, Pierre-Andre Univ Bordeaux LaBRI INRIA Bordeaux Sud Ouest Bordeaux France

ISBN: (纸本)9783642038686

In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed STAR PU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. the main goal of STARPU is to provide numerical kernel designers with it convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.

关键词： Information management

来源：评论

学校读者我要写书评

暂无评论

High Computing-Intensive Array System Design and Hardware Implement

High Computing-Intensive Array System Design and Hardware Im...

引用

IEEE 8th international conference on ASIC

作者： Song, Yu-kun Wang, Xiao-lei Ni, Wei Zhang, Duo-li Du, Gao-ming Hefei Univ Technol Inst VLSI Design Hefei 230009 Peoples R China

ISBN: (纸本)9781424438686

this paper addresses a novel coarse grain dynamic reconfigurable computing system, called DReAC-2, design and hardware implement. A whole DReAC-2 system integrates a Nios II processor, which manages the whole reconfigurable system, and a dynamic reconfigurable coprocessor, which comprises of an 8x8 processing node array designed for high regularity, high computation-intensive tasks. Hardware prototype of DReAC-2 has been implemented on the ALTERA STRATIX II EP2S180 development board. According to task's nature, MIMD computing array can select either parallel-pipelined pattern or array-parallel pattern to gain the better performance. the experiment results show that DReAC-2 achieves much higher 10 similar to 100x factor than NIOS II processors, and 2x similar to 4x factors and higher precision than some others reconfigurable processors(1).

关键词： Reconfigurable Computing System High Computation-intensive Array-parallel parallel-Pipeline

来源：评论

学校读者我要写书评

暂无评论

parallel Particle Swarm Optimization with Adaptive Asynchronous Migration Strategy

引用

9th international conference on algorithms and architectures for parallel processing

作者： Zhan, Zhi-hui Zhang, Jun Sun Yat Sen Univ Dept Comp Sci Guangzhou 510275 Guangdong Peoples R China

ISBN: (纸本)9783642030949

this paper proposes a parallel particle swarm optimization (PPSO) by dividing the search space into sub-spaces and using different swarms to optimize different parts of the space. In the PPSO framework, the search space is regarded as a solution vector and is divided into two sub-vectors. Two cooperative swarms work in parallel and each swarm only optimizes one of the subvectors. An adaptive asynchronous migration strategy (AAMS) is designed for the swarms to communicate with each other. the PPSO benefits from the following two aspects. First, the PPSO divides the search space and each swarm can focus on optimizing a smaller scale problem. this reduces the problem complexity and makes the algorithm promising in dealing with large scale problems. Second, the AAMS makes the migration adapt to the search environment and results in a very timing and efficient communication fashion. Experiments based on benchmark functions have demonstrated the good performance of the PPSO with AAMS oil both solution accuracy and convergence speed when compared with the traditional serial PSO (SPSO) and the PPSO with fixed migration frequency.

关键词： Particle swarm optimization (PSO) adaptive asynchronous migration strategy parallel particle swarm optimization (PPSO) solution accuracy convergence speed

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：