A versatile family of interconnection networks alternative to hypercubes, called Metacubes, has been proposed for building extremely large scale multiprocessor systems with a small number of links per node. A Metacube...
详细信息
A versatile family of interconnection networks alternative to hypercubes, called Metacubes, has been proposed for building extremely large scale multiprocessor systems with a small number of links per node. A Metacube MC(k, m) connects 2(2km + k) nodes with only k + in links per node. Metacube can be used to build parallel computing systems of very large scale with a small number of links per node. In this paper, we propose a new presentation of Metacube for algorithmic design. Based on the new presentation, we give efficient algorithms for parallel prefix computation and parallel sorting on Metacubes, respectively. the algorithm for prefix computation runs in 2(k)m (k + 1) + k communication steps and 2(k + 1)m + 2k computation steps on MC(k, m). the sort algorithm runs in O(2(k)m + k)(2) computation steps and O(2(k)m (2k + 1) + k)(2) communication steps on MC(k, m).
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolut...
详细信息
ISBN:
(纸本)9781424445523
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. the coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. the coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application withthe CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.
In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elastici...
详细信息
ISBN:
(纸本)9781424449231
In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. the developed code has been tested on the "T2K Open Supercomputer (Todai Combined Cluster)" using up to 512 cores. Performance of Hybrid 4x4 parallel programming model is competitive withthat of flat MPI using appropriate command lines for NUMA control. Furthermore, reordering of the mesh data for contiguous access to memory with first touch data placement provides excellent improvement on performance of Hybrid 8x2 and 16x1, especially if the problem size for each core is relatively small. thus, hybrid parallel programming model could be a reasonable choice for large-scale computing of sparse linear solvers on multi-core/multi-socket architectures, such as "T2K Open Supercomputer".
this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE ...
详细信息
ISBN:
(纸本)9781424438686
this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE array processor performs low-level pixel-parallel image processing at high speed and outputs image features for high-level image processing without I/O bottleneck. the dual-core processor carries out high-level image processing. A parallel fast lane detection algorithm for this architecture is developed. the FPGA system with a CMOS image sensor is used to implement the architecture. Experiment results show that the system can perform the fast traffic lane detection at 50fps rate. It is much faster than previous works and has good robustness that can operate in various intensity of light. the novel architecture of vision chip is able to meet the demand of real-time lane departure warning system.
An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture withthe capability of sor...
详细信息
An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture withthe capability of sorting elements represented by integers, floats and structures, while the new merging method gives a way to merge two ordered lists efficiently on GPU without using the slow atomic functions and uncoalesced memory read. Adaptive strategies are used for sorting disorderly or nearly-sorted lists, large or small lists. the current implementation is on NVIDIA CUDA with multi-GPUs support, and is being migrated to the new born Open Computing Language (OpenCL). Extensive experiments demonstrate that our algorithm has better performance than previous GPU-based sorting algorithms and can support real-time applications.
One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the a...
详细信息
ISBN:
(纸本)9781424438686
One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the algorithm level. Secondly, two parallel improved schemes, called 16-pel scale processing and MB-split assignment, are given out in hardware level, which reduces design effort to only 217MHz. Moreover, sub-sampling technique is adopted during SATD (sum-of-absolute-transformed-difference) generation, which saves 75% hardware cost. By using TSMC 0.18um in worst work conditions (1.62V, 125 degrees C), our FME engine can achieve SHV 4kx4k@60fps real-time processing with 547.5k gates hardware.
this paper presents a novel parallel algorithm to synthesize textures in patches. It decomposes the synthesis process into two steps by the chessboard pattern, withthe first step to place patches in the black grids, ...
详细信息
In the field of HPC, the current hardware trend is to design multiprocessor architecturesthat feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e....
详细信息
ISBN:
(纸本)9783642038686
In the field of HPC, the current hardware trend is to design multiprocessor architecturesthat feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed STAR PU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. the main goal of STARPU is to provide numerical kernel designers with it convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithmsthat take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
this paper addresses a novel coarse grain dynamic reconfigurable computing system, called DReAC-2, design and hardware implement. A whole DReAC-2 system integrates a Nios II processor, which manages the whole reconfig...
详细信息
ISBN:
(纸本)9781424438686
this paper addresses a novel coarse grain dynamic reconfigurable computing system, called DReAC-2, design and hardware implement. A whole DReAC-2 system integrates a Nios II processor, which manages the whole reconfigurable system, and a dynamic reconfigurable coprocessor, which comprises of an 8x8processing node array designed for high regularity, high computation-intensive tasks. Hardware prototype of DReAC-2 has been implemented on the ALTERA STRATIX II EP2S180 development board. According to task's nature, MIMD computing array can select either parallel-pipelined pattern or array-parallel pattern to gain the better performance. the experiment results show that DReAC-2 achieves much higher 10 similar to 100x factor than NIOS II processors, and 2x similar to 4x factors and higher precision than some others reconfigurable processors(1).
this paper proposes a parallel particle swarm optimization (PPSO) by dividing the search space into sub-spaces and using different swarms to optimize different parts of the space. In the PPSO framework, the search spa...
详细信息
ISBN:
(纸本)9783642030949
this paper proposes a parallel particle swarm optimization (PPSO) by dividing the search space into sub-spaces and using different swarms to optimize different parts of the space. In the PPSO framework, the search space is regarded as a solution vector and is divided into two sub-vectors. Two cooperative swarms work in parallel and each swarm only optimizes one of the subvectors. An adaptive asynchronous migration strategy (AAMS) is designed for the swarms to communicate with each other. the PPSO benefits from the following two aspects. First, the PPSO divides the search space and each swarm can focus on optimizing a smaller scale problem. this reduces the problem complexity and makes the algorithm promising in dealing with large scale problems. Second, the AAMS makes the migration adapt to the search environment and results in a very timing and efficient communication fashion. Experiments based on benchmark functions have demonstrated the good performance of the PPSO with AAMS oil both solution accuracy and convergence speed when compared withthe traditional serial PSO (SPSO) and the PPSO with fixed migration frequency.
暂无评论