Any communication model can be well characterized by locality properties and, any topology has its intrinsic, structural, locality characteristics. Spatial locality of processing cores in a multi-core chip can be expl...
详细信息
ISBN:
(纸本)9780769536415
Any communication model can be well characterized by locality properties and, any topology has its intrinsic, structural, locality characteristics. Spatial locality of processing cores in a multi-core chip can be exploited to gain computational efficiency of a network on chip. In this paper we propose a new criterion in performance evaluation of NoC architecture that is based on the concept of group locality in an interconnection network, the "lower layer complete connect". TriBA a new idea in multi-core architectures and a direct interconnection network (DIN), is compared with 2D Mesh on single chip multi core architecture. TriBA consists of a 2D grid of small, programmable processing units, each physically connected to its three neighbors so that advantageous features of group locality can be fully and efficiently utilized for getting maximum out of an on-chip interconnection of cores. Cores on the same chip are connected via triplet-based hierarchical interconnection network (thIN), which has simple topology and computing locality characteristic. We have modeled execution of dense matrix and sorting algorithm on an on-chip multi core interconnection network. Our results show that Triplet Based interconnection architecture has strong spatial locality characteristics in comparison to the conventional 2D mesh. the computational efficiency of Triplet based interconnection is remarkable when the number of processing cores increase substantially.
Multi-field packet classification is a critical function that enables network routers to support a variety of applications such as firewall processing, Quality of Service differentiation, traffic billing, and other va...
详细信息
ISBN:
(纸本)9781424445523
Multi-field packet classification is a critical function that enables network routers to support a variety of applications such as firewall processing, Quality of Service differentiation, traffic billing, and other value added services. Explosive growth of Internet traffic requires the future packet classifiers be implemented in hardware. However, most of the existing packet classification algorithms need large amount of memory, which inhibits efficient hardware implementations. this paper exploits the modem FPGA technology and presents a partitioning-based parallel architecture for scalable and high-speed packet classification. We propose a coarse-grained independent sets algorithm and then combine it seamlessly withthe cross-producting scheme. After partitioning the original rule set into several coarse-grained independent sets and applying the cross-producting scheme for the remaining rules, the memory requirement is dramatically reduced. Our FPGA implementation results show that our architecture can store 10K real-life rules in a single state-of-the-art FPGA while consuming a small amount of on-chip resources. Post place and route results show that the design sustains 90 Gbps throughput for minimum size (40 bytes) packets, which is more than twice the current backbone network link rate.
In this paper, we propose a fast and flexible sorting algorithm with CUDA. the proposed algorithm is Much more practical than the previous GPU-based sorting algorithms, as it is able to handle the sorting of elements ...
详细信息
ISBN:
(纸本)9783642030949
In this paper, we propose a fast and flexible sorting algorithm with CUDA. the proposed algorithm is Much more practical than the previous GPU-based sorting algorithms, as it is able to handle the sorting of elements represented by integers, floats and structures. Meanwhile, our algorithm is optimized for the modern GPU architecture to obtain high performance. We use different strategies for sorting disorderly list and nearly-sorted list to make it adaptive. Extensive experiments demon- strate our algorithm has higher performance than previous GPU-based sorting algorithms and can support real-time applications.
the need to exploit multi-core systems for parallelprocessing has revived the concept of dataflow. In particular the Dataflow Multithreading architectures have proven to be good candidates for these systems. In this ...
详细信息
ISBN:
(纸本)9780769536804
the need to exploit multi-core systems for parallelprocessing has revived the concept of dataflow. In particular the Dataflow Multithreading architectures have proven to be good candidates for these systems. In this work we propose an abstraction layer that enables compiling and running a program written for an abstract Dataflow Multithreading architecture on different implementations. More specifically, we present a set of compiler directives that provide the programmer withthe means to express most types of dependencies between code segments. In addition, we present the corresponding toolchain that transforms this code into a form that can be compiled for different implementations of the model. As a case study for this work, we present the usage of the toolchain for the TFlux and DTA architectures.
An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture withthe capability of sor...
详细信息
An efficient GPU-based sorting algorithm is proposed in this paper together with a merging method on graphics devices. the proposed sorting algorithm is optimized for modern GPU architecture withthe capability of sorting elements represented by integers, floats and structures, while the new merging method gives a way to merge two ordered lists efficiently on GPU without using the slow atomic functions and uncoalesced memory read. Adaptive strategies are used for sorting disorderly or nearly-sorted lists, large or small lists. the current implementation is on NVIDIA CUDA with multi-GPUs support, and is being migrated to the new born Open Computing Language (OpenCL). Extensive experiments demonstrate that our algorithm has better performance than previous GPU-based sorting algorithms and can support real-time applications.
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolut...
详细信息
ISBN:
(纸本)9781424445523
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. the coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. the coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application withthe CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.
In this paper, we propose a new versatile network, called a recursive dual-net (RDN), as a potential candidate for the interconnection network of supercomputers of the next generation. the RDN is based on recursive du...
详细信息
In this paper, we propose a new versatile network, called a recursive dual-net (RDN), as a potential candidate for the interconnection network of supercomputers of the next generation. the RDN is based on recursive dual-construction of a base network. A k-level recursive dual construction for k > 0 creates a network containing (2m)2(k)/2 nodes with node-degree d + k, where in and d are the number of nodes and the node-degree of the base network, respectively. the RDN is node and edge symmetric if the base network is node and edge symmetric. the RDN can contain a huge number of nodes, each with small node-degree and short diameter. For example, we can construct a symmetric RDN connecting more than 3-million nodes with only 6 links per node and a diameter of 22. We investigate the topological properties of the RDN and compare them to those of other networks including 3D torus, WK-recursive network, hypercube, cube-connected-cycle, and dual-cube. We also establish the efficient routing and broadcasting algorithms for the RDN.
In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elastici...
详细信息
ISBN:
(纸本)9781424449231
In this work, parallel preconditioning methods based on "Hierarchical Interface Decomposition (HID)" and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. the developed code has been tested on the "T2K Open Supercomputer (Todai Combined Cluster)" using up to 512 cores. Performance of Hybrid 4x4 parallel programming model is competitive withthat of flat MPI using appropriate command lines for NUMA control. Furthermore, reordering of the mesh data for contiguous access to memory with first touch data placement provides excellent improvement on performance of Hybrid 8x2 and 16x1, especially if the problem size for each core is relatively small. thus, hybrid parallel programming model could be a reasonable choice for large-scale computing of sparse linear solvers on multi-core/multi-socket architectures, such as "T2K Open Supercomputer".
this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE ...
详细信息
ISBN:
(纸本)9781424438686
this paper presents a novel architecture of vision chip for fast traffic lane detection (FTLD). the architecture consists of a 32*32 SIMD processing element (PE) array processor and a dual-core RISC processor. the PE array processor performs low-level pixel-parallel image processing at high speed and outputs image features for high-level image processing without I/O bottleneck. the dual-core processor carries out high-level image processing. A parallel fast lane detection algorithm for this architecture is developed. the FPGA system with a CMOS image sensor is used to implement the architecture. Experiment results show that the system can perform the fast traffic lane detection at 50fps rate. It is much faster than previous works and has good robustness that can operate in various intensity of light. the novel architecture of vision chip is able to meet the demand of real-time lane departure warning system.
One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the a...
详细信息
ISBN:
(纸本)9781424438686
One Super Hi-Vision (SHV) 4kx4k@60fps fractional motion estimation (FME) engine is proposed in this paper. Firstly, the mode reduction and edge detection techniques are adopted to filter out unpromising modes in the algorithm level. Secondly, two parallel improved schemes, called 16-pel scale processing and MB-split assignment, are given out in hardware level, which reduces design effort to only 217MHz. Moreover, sub-sampling technique is adopted during SATD (sum-of-absolute-transformed-difference) generation, which saves 75% hardware cost. By using TSMC 0.18um in worst work conditions (1.62V, 125 degrees C), our FME engine can achieve SHV 4kx4k@60fps real-time processing with 547.5k gates hardware.
暂无评论