the rapid increase in pixel density and frame rates of modern imaging sensors is accelerating the demand for fine-grained and embedded parallelization strategies to achieve real-time implementations for video analysis...
详细信息
ISBN:
(纸本)9783642046964
the rapid increase in pixel density and frame rates of modern imaging sensors is accelerating the demand for fine-grained and embedded parallelization strategies to achieve real-time implementations for video analysis. the IBM Cell Broadband Engine (BE) processor has an appealing multi-core chip architecture with multiple programming models suitable for accelerating multimedia and vector processing applications. this paper describes two parallelalgorithms for blob extraction in video sequences: binary morphological operations and connected components labeling (CCL), both optimized for the Cell-BE processor. Novel parallelization and explicit instruction level optimization techniques are described for fully exploiting the computational capacity of the Synergistic processing Elements (SPEs) on the Cell processor. Experimental results show significant speedups ranging from a factor of nearly 300 for binary morphology to a factor of 8 for COL in comparison to equivalent sequential implementations applied to High Definition (HD) video.
In this paper, we contribute two reconfigurable integer motion estimation (IME) architectures (namely RSADT and RPPSAD) based on adaptive algorithm. Firstly, based on the pixel difference analysis, the spatial redunda...
详细信息
In this paper, we contribute two reconfigurable integer motion estimation (IME) architectures (namely RSADT and RPPSAD) based on adaptive algorithm. Firstly, based on the pixel difference analysis, the spatial redundancy is further exploited and three subsampling patterns are selected adaptively. Secondly, in order to keep full data reuse, we propose an architecture level data organization for RSADT architecture. For RPPSAD, we apply pixel classification and memory organization to keep full data reuse. An interactive data loading scheme is proposed to reduce power dissipation. Experiments show that, with some extra hardware, our RSADT can averagely achieve 65.86% reduction in processing time; as for RPPSAD, it can save 25.4% to 39.8% power dissipation when processing typical HDTV720p sequences.
this paper describes the implementation of a real-time pedestrian detector on a single instruction, multiple data (SIMD), fixed-point digital signal processor (DSP). We reformulate the Histogram of Oriented Gradients ...
详细信息
this paper describes the implementation of a real-time pedestrian detector on a single instruction, multiple data (SIMD), fixed-point digital signal processor (DSP). We reformulate the Histogram of Oriented Gradients algorithm for calculation with a relatively simple instruction set architecture (ISA) and partition the image for parallelprocessing. Results obtained using an ISA simulator indicate a maximum frame rate above 40 fps for 1 MPixel images, with a detection accuracy comparable to a double-precision floating-point reference implementation.
LILY is a high performance VLIW DSP processor for multimedia applications, developed by Tsinghua University. the processor classifies the instructions, and determines whether the instructions should be issued in paral...
详细信息
LILY is a high performance VLIW DSP processor for multimedia applications, developed by Tsinghua University. the processor classifies the instructions, and determines whether the instructions should be issued in parallel according to the order of the instructions. Under this parallelism, LILY processor is capable of saving one bit of operation code in the condition of inserting very few no operation (NOP) instructions. In addition, it is needed to design a corresponding assembler to accommodate the above new parallelism, which aids LILY to complete the highly efficient method. the evaluation results show satisfactory suitability of the processor for high performance applications, high code density, and small program code size.
High-performance and flexible configurable extract instructions targeted at stream cipher processing are proposed by analyzing the structures and operating characteristics of more than forty public stream cipher algor...
详细信息
High-performance and flexible configurable extract instructions targeted at stream cipher processing are proposed by analyzing the structures and operating characteristics of more than forty public stream cipher algorithms in this *** extract instructions are designed to sustain four different data widths,and ten parallel extract modes are exploited by instruction level parallelism based on VLIW system *** more,the corresponding reconfigurable hardware circuit is *** configurating the hardware circuit,the extract of different data width and different parallel mode can be gained efficiently,so the circuit can be used as an important accelerated unit in special processing for stream cipher.
It's hard for a user ignorant of parallel programming to resolve models by numerical analysis method in distributed environments, which normally involves state space generation (SSG) in parallel. To lower this thr...
详细信息
It's hard for a user ignorant of parallel programming to resolve models by numerical analysis method in distributed environments, which normally involves state space generation (SSG) in parallel. To lower this threshold, an automatic parallelization approach based on MapReduce framework is presented in this paper. It has been implemented in a small-scale distributed environment, and its correctness and feasibility has been verified by the results of experiments carried on a series of models varying in scale.
Multi-processors system on chip (MPSOC) is emerging as solutions for high performance embedded systems. Although important work have been achieved in the design and implementation of such systems the issue of parallel...
详细信息
Multi-processors system on chip (MPSOC) is emerging as solutions for high performance embedded systems. Although important work have been achieved in the design and implementation of such systems the issue of parallel software design have not yet been properly evaluated for these targets. We present in this work automatic parallelization experiment results on a 16PE NOC based MPSOC which we designed and implemented on a single FPGA chip. All reported results come from actual execution and show that speed-up becomes limited beyond 8 processors in this external memory constrained environment.
As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on multi-core network-on-chips (NoCs). For data-parallel applications, we st...
详细信息
As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on multi-core network-on-chips (NoCs). For data-parallel applications, we study the model of parallel speedup by including network communication latency in Amdahl's law. the speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallelprocessing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program.
this paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary...
详细信息
ISBN:
(纸本)9781424445523
this paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary pattern transform and feature evaluation are performed in parallel by using the proposed block RAM-based window processing architecture. By sharing the feature look-up tables between two corresponding scaled-down images, we can reduce the use of routing resources by half. For prototyping and evaluation purposes, the hardware architecture was integrated into a Virtex-5 FPGA. the experimental result shows around 300 frames per second speed performance for processing standard VGA (640×480×8) images. In addition, the throughput of the implementation can be adjusted in proportion to the frame rate of the camera, by synchronizing each individual module withthe pixel sampling clock.
High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design...
详细信息
ISBN:
(纸本)9781424445523
High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design of such programmable or non-programmable accelerators. Loop tiling is a widely used transformation in such methodologies for dimensioning of such accelerators in order to match inherent massive parallelism of considered algorithms with available functional units and processor elements. Innately, the applications are data-flow dominant and have almost no control flow, but the application of tiling techniques has the disadvantage of a more complex control and communication flow. In this paper, we present a methodology for the automatic generation of the control engines of such accelerators. the controller orchestrates the data transfer and computation. the effect of tiling on area, latency, and power overhead of the controller is studied in detail. It is shown that the controller has a substantial overhead of up to 50% in for different tiling and throughput parameters. the energy-delay product is also used as a metric for identifying optimal accelerator designs.
暂无评论