Segmentation and other image processing operations rely on convolution calculations with heavy computational and memory access demands. this paper presents an analysis of a texture segmentation application containing ...
详细信息
ISBN:
(纸本)0818656026
Segmentation and other image processing operations rely on convolution calculations with heavy computational and memory access demands. this paper presents an analysis of a texture segmentation application containing a 96x96 convolution. Sequential execution required several hours on single processors systems with over 99% of the time spent performing the large convolution. 70% to 75% of execution time is attributable to cache misses within the convolution. We implemented the same application on CM-5, iPSC/860 and PVM distributed memory multicomputers, tailoring the parallelalgorithms to each machine's architectures. parallelization significantly reduced execution time, taking 49 second on a 512 node CM-5 and 6.5 minutes on a 32 node iPSC/860.
Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. this paper presents the multiprocessor architecture of the Simultaneous...
详细信息
Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. this paper presents the multiprocessor architecture of the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), and examines the performance of representative algorithms for matrix operations, merging and sorting. using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can result in communication time of 0(l) for the matrix-vector multiplication algorithm using DSM. the SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over 100 nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidththat scales directly withthe number of nodes in the system. Each of P nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. the entire P receiver array can be integrated on a single chip at a comparatively minor cost resulting in O(P) complexity. the SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster. (C) 2003 Elsevier B.V. All rights reserved.
In this paper, we propose an efficient algorithm for parallel prefix computation in recursive dual-net, a newly proposed network. the recursive dual-net RDNk (B) for k > 0 has (2n(0))(2k) /2 nodes and d(0) + k link...
详细信息
ISBN:
(纸本)9783642131189
In this paper, we propose an efficient algorithm for parallel prefix computation in recursive dual-net, a newly proposed network. the recursive dual-net RDNk (B) for k > 0 has (2n(0))(2k) /2 nodes and d(0) + k links per node, where no and do are the number of nodes and the node-degree of the base network B, respectively. Assume that each node holds one data item, the communication and computation time complexities of the algorithm for parallel prefix computation in RDNk (B),k > 0, are 2(k+1) - 2 + 2k * T-comm(0) and 2(k+1) - 2 + 2(k) * T-comp(0), respectively, where T-comm(0) and T-comp(0) are the communication and computation time complexities of the algorithm for parallel prefix computation in the base network B, respectively.
Convolutional Neural Network (CNN) is the state-ofthe-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation comp...
详细信息
ISBN:
(纸本)9781509028603
Convolutional Neural Network (CNN) is the state-ofthe-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation complexity and thus consume major computational power in real implementations. In this paper, efficient hardware architectures incorporating parallel fast finite impulse response (FIR) algorithm (FFA) for CNN convolution implementations are discussed. the theoretical derivation of 3 and 5 parallel FFAs is presented and the corresponding 3 and 5 parallel fast convolution units (FCUs) are proposed for most commonly used 3 x 3 and 5 x 5 convolutional kernels in CNNs, respectively. Compared to conventional CNN convolution architectures, the proposed FCUs reduce the number of multiplications used in convolutions significantly. Additionally, the FCUs minimize the number of reads from the feature map memory. Furthermore, a reconfigurable FCU architecture which suits the convolutions of both 3 x 3 and 5 x 5 kernels is proposed. Based on this, an efficient top-level architecture for processing a complete convolutional layer in a CNN is developed. To quantize the benefits of the proposed FCUs, the design of an FCU is coded with RTL and synthesized with TSMC 90nrn CMOS technology. the implementation results demonstrate that 30% and 36% of the computational energy can be saved compared to conventional solutions with 3 x 3 and 5 x 5 kernels in CNN, respectively.
the complexity of efficiently programming massively parallel machines is illustrated by presenting a number of new algorithms. these algorithms deal with computational geometry, data histogramming, list manipulation, ...
详细信息
ISBN:
(纸本)0818608781
the complexity of efficiently programming massively parallel machines is illustrated by presenting a number of new algorithms. these algorithms deal with computational geometry, data histogramming, list manipulation, and other problems or operations that arise in computer vision tasks. All of the algorithms presented use a divide-and-conquer strategy, and they all use routines that solve a problem of size M using a machine of size N, where M < N. In some of the algorithms, the extra processors are used to perform more than M calculations in parallel. In other algorithms, the extra processors are used to improve the speed of interprocessor communication.
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical probl...
详细信息
ISBN:
(纸本)9783540695004
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical problems, like debugging parallel programs. In this paper we introduce the first debugger that works with any parallel extension of the functional language Haskell, the de facto standard in the (lazy evaluation) functional programming community. the debugger is implemented as an independent library. thus, it can be used with any Haskell compiler. Moreover, the debugger can be used to analyze how much speculative work has been done in any program.
Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via ex...
详细信息
ISBN:
(纸本)9781450365109
Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such architectures present several challenges to their efficient use. Here, we explore the efficient implementation of sparse matrix-vector (SpMV) multiplications-a critical kernel for the workhorse iterative linear solvers used in most PDE-based simulations-on recent CPU architectures from Intel as well as the second-generation Knights Landing Intel Xeon Phi, which features many CPU cores, wide SIMD lanes, and on-package high-bandwidth memory. Traditional SpMV algorithms use compressed sparse row storage format, which is a hindrance to exploiting wide SIMD lanes. We study alternative matrix formats and present an efficient optimized SpMV kernel, based on a sliced ELLPACK representation, which we have implemented in the PETSc library. In addition, we demonstrate the benefit of using this representation to accelerate preconditioned iterative solvers in realistic PDE-based simulations in parallel.
Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallelalgorithms in particular. We present an adaptive algorithm selection framework whic...
详细信息
ISBN:
(纸本)0769522297
Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallelalgorithms in particular. We present an adaptive algorithm selection framework which can collect and interpret the inputs of a particular instance of a parallel algorithm and select the best performing one from a an existing library. In this paper present the dynamic selection of parallel reduction algorithms. First we introduce a set of high-level parameters that can characterize different parallel reduction algorithms. then we describe an off-line, systematic process to generate predictive models which can be used for run-time algorithm selection. Our experiments show that our framework: (a) selects the most appropriate algorithms in 85% of the cases studied, (b) overall delievers 98% of the optimal performance, (c) adaptively selects the best algorithms for dynamic phases of a running program (resulting in performance improvements otherwise not possible), and (d) adapts to the underlying machine architecture (tested on IBM Regatta and HP V-Class systems).
Recently, the OpenCL hardware-software co-design methodology has gained traction in realizing effective parallel architecture designs in heterogeneous FPGA platforms. In fact, the portability of OpenCL on hardware rea...
详细信息
ISBN:
(纸本)9781479982523
Recently, the OpenCL hardware-software co-design methodology has gained traction in realizing effective parallel architecture designs in heterogeneous FPGA platforms. In fact, the portability of OpenCL on hardware ready platforms such as GPU or multicore CPU enables ease of design verification. this is true especially for parallelalgorithms before implementing them using cumbersome HDL-based RTL design. In this paper we employed OpenCL programming platform based on Altera SDK for OpenCL (AOCL) to implement a Sobel filter algorithm as an image processing test case on a Cyclone V FPGA board. Using the portability of this platform, the performance of the kernel code is benchmarked against that of the GPU and multicore CPU implementations for different image and kernel sizes. Different optimization strategies are also applied for each platform. We found that increasing the Sobel filter kernel size from 3 x3 to 5 x 5 results in only 11.3% increase in computation time for FPGA, while the effect was much more significant where the execution time was as high as 23.6% and 85.7% for CPU and GPU, respectively.
the Accurate Automation Corporation (AAC) neural network processor (NNP) module is a fully programmable multiple instruction multiple data (MIMD) parallel processor optimized for the implementation of neural networks....
详细信息
ISBN:
(纸本)0819415472
the Accurate Automation Corporation (AAC) neural network processor (NNP) module is a fully programmable multiple instruction multiple data (MIMD) parallel processor optimized for the implementation of neural networks. the AAC NNP design fully exploits the intrinsic sparseness of neural network topologies. Moreover, by using a MIMD parallelprocessing architecture one can update multiple neurons in parallel with efficiency approaching 100 percent as the size of the network increases. Each AAC NNP module has 8 K neurons and 32 K interconnections and is capable of 140,000,000 connections per second with an eight processor array capable of over one billion connections per second.
暂无评论