this paper presents a parallelized architecture of multiple classifiers for face detection based on the Viola and Jones object detection method. this method makes use of the AdaBoost algorithm which identifies a seque...
详细信息
ISBN:
(纸本)9781424445523
this paper presents a parallelized architecture of multiple classifiers for face detection based on the Viola and Jones object detection method. this method makes use of the AdaBoost algorithm which identifies a sequence of Haar classifiers that indicate the presence of a face. We describe the hardware design techniques including image scaling, integral image generation, pipelined processing of classifiers, and parallelprocessing of multiple classifiers to accelerate the processing speed of the face detection system. Also we discuss the parallelized architecture which can be scalable for configurable device with variable resources. We implement the proposed architecture in Verilog HDL on a Xilinx Virtex-5 FPGA and show the paralletized architecture of multiple classifiers can have 3.3x performance gain over the architecture of a single classifier and an 84x performance gain over an equivalent software solution.
In the sequential model of programming, instructions in a program are executed sequentially. Existing, programming languages are mainly designed for the sequential model. As the programming paradigm shifts from the se...
详细信息
ISBN:
(纸本)9783642030949
In the sequential model of programming, instructions in a program are executed sequentially. Existing, programming languages are mainly designed for the sequential model. As the programming paradigm shifts from the sequential to distributed computing, existing sequential programming languages have their limitations. Nevertheless, the sequential languages are the languages which most of programmers are most familiar with. One of the motivations of this research is to implement a framework to support the implementations of distributed applications using Sequential programming languages Such as C/C++, COBOL, and Java. In this paper, we present an implementation of a framework for open distributed programming. Allowing programmers to write distributed programs in their favorite sequential programming languages makes the programming paradigm very unique to the existing programming paradigms.
this paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary...
详细信息
ISBN:
(纸本)9781424445523
this paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary pattern transform and feature evaluation are performed in parallel by using the proposed block RAM-based window processing architecture. By sharing the feature look-up tables between two corresponding scaled-down images, we can reduce the use of routing resources by half. For prototyping and evaluation purposes, the hardware architecture was integrated into a Virtex-5 FPGA. the experimental result shows around 300 frames per second speed performance for processing standard VGA (640x480x8) images. In addition, the throughput of the implementation can be adjusted in proportion to the frame rate of the camera, by synchronizing each individual module withthe pixel sampling clock.
this paper studies the loosely integration of application accelerators consisting of an array of tightly-coupled lightweight reconfigurable processors into a system-on-a-chip. In order to explore a multitude of design...
详细信息
ISBN:
(纸本)9781424449231
this paper studies the loosely integration of application accelerators consisting of an array of tightly-coupled lightweight reconfigurable processors into a system-on-a-chip. In order to explore a multitude of design variations a C++ simulation model of the accelerator has been integrated with a system-on-a-chip environment consisting of a general purpose processor, a DMA controller, an interrupt controller and a memory module. Dependent on the applications, different kinds of I/O buffers are designed around the processor array and the effects of the buffer size on the overall execution time are evaluated. the evaluations are based on new mathematical estimation models derived from the system and application constraints. the estimations are validated with experimental results with an error less than 1%. Exploring several designs points that using our architecture along with suitable buffer sizes, can improve the system execution time, one to two magnitudes for the selected algorithms.
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolut...
详细信息
ISBN:
(纸本)9781424445523
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. the coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a "meta-operator" to which a CNN may be compiled to. the coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm's simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCl FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. the coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application withthe CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.
Generally, Hardware/Software (HW/SW) partitioning can be approximately resolved through some kinds of optimal algorithms. Based oil both characteristics of HW/SW partitioning and Particle Swarm Optimization (PSO) algo...
详细信息
ISBN:
(纸本)9783642030949
Generally, Hardware/Software (HW/SW) partitioning can be approximately resolved through some kinds of optimal algorithms. Based oil both characteristics of HW/SW partitioning and Particle Swarm Optimization (PSO) algorithm, a novel parallel FlW/SW partitioning method is proposed in this paper. A model of parallel HW/SW partitioning on the basis of PSO algorithm is established after analyzing the particularity of HW/SW partitioning. A hybrid strategy of PSO and Tabu Search (TS) is proposed in this paper, which uses the intrinsic parallelism of PSO and the memory function of TS to speed tip and improve the performance of PSO. To settle the problem of premature convergence, the reproduction and crossover operation of genetic algorithm (GA) is also introduced into procedure of PSO. Experimental results indicate that the parallel PSO algorithm can efficiently reduce the running time even for large task graphs.
Fast Fourier Transform (FFT) is widely used in OFDM. To increase the speed of data processing, there is designing a new system that is composed of two DSP subsystems and an X86-PC subsystem. In the experiment, Paralle...
详细信息
ISBN:
(纸本)9781424436927
Fast Fourier Transform (FFT) is widely used in OFDM. To increase the speed of data processing, there is designing a new system that is composed of two DSP subsystems and an X86-PC subsystem. In the experiment, parallel DSP subsystems deal withthe primary data. For analyzing the experiment result, the conclusion can be made that there is an outstanding capacity in dealing with FFT and the capacity can be farther improved.
parallel sorting algorithms in hypercubes have been studied extensively. One of the practical parallel sorting algorithms is Bitonic Sort, which is implemented in O(n(2)) time for sorting N = 2(n) numbers in an n-cube...
详细信息
ISBN:
(纸本)9783642030949
parallel sorting algorithms in hypercubes have been studied extensively. One of the practical parallel sorting algorithms is Bitonic Sort, which is implemented in O(n(2)) time for sorting N = 2(n) numbers in an n-cube. A versatile family of interconnection networks alternative to hypercube, called metacube, was proposed for building extremely large scale multiprocessor systems with a small number of links per node. A metacube MC(k, m) connects 2(2km+k) nodes with only k + m links per node. In this paper, we present an efficient sorting algorithm on metacube multiprocessors. the proposed sorting algorithm is based on the Batcher's bitonic sorting algorithm. In order to perform the parallel sorting efficiently in metacube, we give a new presentation of the metacube such that the communications required by the algorithm can be done efficiently with gather and scatter operations. the parallel bitonic sort algorithm implemented in metacubes withthe new presentation runs in O(2m(k) + k)(2) computation steps and O(2(m)(k)(2k + 1) + k)(2) communication steps.
the multicore revolution is underway. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at minimizing the number of cache misses paid during the...
详细信息
In this paper, a new parallel Montgomery binary exponentiation algorithm was proposed. this algorithm is based on the Montgomery modular reduction technique, binary method, common-multiplicand-multiplication (CMM) alg...
详细信息
ISBN:
(纸本)9783642030949
In this paper, a new parallel Montgomery binary exponentiation algorithm was proposed. this algorithm is based on the Montgomery modular reduction technique, binary method, common-multiplicand-multiplication (CMM) algorithm, and the canonical-signed-digit recoding (CSD) technique. By using the CMM algorithm of computing the common part from two modular multiplications, the same common part in two modular multiplications can be computed once rather twice, we can thus improve the efficiency of the binary exponentiation algorithm by decreasing the number of modular multiplications. Furthermore, by using the proposed parallel CMM-CSD Montgomery binary exponentiation algorithm, the total number of single-precision multiplications can be reduced by about 66.7% and 30% as compared withthe original Montgomery algorithm and the Ha-Moon's improved Montgomery algorithm, respectively.
暂无评论