the proceedings contain 21 papers. the topics discussed include: a new direct connected component labeling and analysis algorithms for GPUs;accelerating the inference phase in ternary convolutional neural networks usi...
ISBN:
(纸本)9781538682371
the proceedings contain 21 papers. the topics discussed include: a new direct connected component labeling and analysis algorithms for GPUs;accelerating the inference phase in ternary convolutional neural networks using configurable processors;real-time analysis of living biological cell activity;POLYBiNN: a scalable and efficient combinatorial inference engine for neural networks on FPGA;low power imageprocessing applications on FPGAs using dynamic voltage scaling and partial reconfiguration;efficient architecture for implementation of hermite interpolation on FPGA;efficient task-based code generation for SDF graph execution on multicore processors;and design flow for portable dataflow programming of heterogeneous platforms.
the proceedings contain 12 papers. the topics discussed include: using time-of-flight sensors for people counting applications;CNN hardware acceleration on a low-power and low-cost APSoC;POLYCiNN: multiclass binary in...
ISBN:
(纸本)9781728140742
the proceedings contain 12 papers. the topics discussed include: using time-of-flight sensors for people counting applications;CNN hardware acceleration on a low-power and low-cost APSoC;POLYCiNN: multiclass binary inference engine using convolutional decision forests;mapping and frequency joint optimization for energy efficient execution of multiple applications on multicore systems;FPGA-Based acceleration of expectation maximization algorithm using high-level synthesis;SparseCCL: connected components labeling and analysis for sparse images;distilling the knowledge in CNN for WCE screening tool;real-time implementation of adaptive correlation filter tracking for 4K video stream in Zynq ultrascale+ MPSoC;and speeding-up CNN inference through dimensionality reduction.
the processing platforms of contemporary mobile devices are commonly built around System-on-Chip (SoC) solutions that contain general purpose processor cores (GPPs), digital signal processors, accelerator circuits, an...
详细信息
ISBN:
(纸本)9781538682371
the processing platforms of contemporary mobile devices are commonly built around System-on-Chip (SoC) solutions that contain general purpose processor cores (GPPs), digital signal processors, accelerator circuits, and possibly a graphics processing unit (GPU) as processing resources. Software design for such SoCs can be very time-consuming, as the various processing resource types (e.g. GPPs and GPUs) conventionally require different languages to be programmed. For example, GPUs are programmed via CUDA or OpenCL, whereas GPPs are commonly programmed in C++. As a consequence, code that has originally been written for one processing resource, cannot necessarily be executed on a different processing resource type. this paper presents a novel design flow that addresses this code portability challenge. On a high level the application is described using a dataflow graph, whereas the detailed functionality of application components is written in Halide, a performance portable language that provides code generation for OpenCL, CUDA, HVX DSP, ARM and x86 targets. the proposed design flow is built around PRUNE, a recent dataflow programming framework. the functionality of the design flow is presented withthree case study applications, and the measurements show an average speedup of 9.3x over single-core C code when the proposed design flow is used.
In this paper hardware implementation of selected contextual based image pre-processing modules for a 3840x2160 @60 fps video stream in a Zynq UltraScale+ MPSoC is discussed. the following operations are considered: s...
详细信息
ISBN:
(纸本)9781538682371
In this paper hardware implementation of selected contextual based image pre-processing modules for a 3840x2160 @60 fps video stream in a Zynq UltraScale+ MPSoC is discussed. the following operations are considered: simple averaging (box filter), Gaussian filter, edge detection using the Sobel and Canny methods, median filter and morphological erosion and dilation operations. the scheme for implementing contextual based operations for a video stream in the format of 2 and 4 pixels per clock and challenges related to the pipelined implementation of processing such data are described. Also the use of logic resources and energy efficiency of modules described in the Verilog hardware description language and using the High Level Synthesis tools (Vivado HLS, SDSoC and xfOpenCV library) are compared. All designed modules support real-time processing of a 4K@60 fps video stream.
In this study we investigate the parallelization of a key feature extraction method called spectral correlation density (SCD) function, which is used in signal classification systems particularly under low signal-to-n...
详细信息
ISBN:
(纸本)9781538682371
In this study we investigate the parallelization of a key feature extraction method called spectral correlation density (SCD) function, which is used in signal classification systems particularly under low signal-to-noise ratio conditions for classifying numerous signals. In order to reduce the computation complexity of the SCD function, we introduce a method called Quarter SCD (QSCD) that allows extracting features of a given signal by processing only quarter of the input signal data. We then parallelize the QSCD by targeting general purpose graphics processing unit (GPU) through architecture specific optimization strategies. We present experimental evaluations on identifying the parallelization configuration for maximizing the efficiency of the program architecture in utilizing the threading power of the GPU architecture. We show that algorithmic and architecture specific optimization strategies result with improving the throughput of the state of the art GPU based Full SCD from 120 signals/second to 2719 signals/second.
Until recent years, labeling algorithms for GPUs have been iterative. this was a major problem because the computation time depended on the content of the image. the number of iterations to reach the stability of labe...
详细信息
ISBN:
(纸本)9781538682371
Until recent years, labeling algorithms for GPUs have been iterative. this was a major problem because the computation time depended on the content of the image. the number of iterations to reach the stability of labels propagation could be very high. In the last years, new direct labeling algorithms have been proposed. they add some extra tests to avoid memory accesses and serialization due to atomic instructions. this article presents two new algorithms, one for labeling (CCL) and one for analysis (CCA). these algorithms use a new data structure combined with low-level intrinsics to leverage the architecture. the connected component analysis algorithm can efficiently compute features like bounding rectangles or statistical moments. A benchmark on a Jetson TX2 shows that the labeling algorithm is from 1.8 up to 2.7 times faster than the State-of-the-Art and can reach a processing rate of 200 fps for a resolution of 2048x2048.
the TULIPP project aims to facilitate the development of embedded imageprocessing systems with real-time and low-power constraints. In this paper, several adaptive dynamic runtime techniques for reconfigurable SoCs a...
详细信息
Interpolation is widely used for a number of applications. For instance, it is used to generate geometric models, path trajectories and various applications of signalprocessing including data compression. Traditional...
详细信息
ISBN:
(纸本)9781538682371
Interpolation is widely used for a number of applications. For instance, it is used to generate geometric models, path trajectories and various applications of signalprocessing including data compression. Traditionally, software implementation of interpolation is carried out using a General Purpose Processor (GPP) or Digital signal Processor (DSP). Field Programmable Gate Arrays (FPGAs) however, offer a more customizable and faster solution. this paper presents a hardware accelerator architecture for Hermite Interpolation on FPGA. Cubic Hermite interpolation, specifically Catmull-Rom spline is preferred over other techniques due to ease of computation and local control which makes it an ideal method to implement very low-complexity architecture. A data flow approach has been used as it provides high throughput for intensive data processing. Additionally, the efficiency of computation has been improved by implementing a pipelined architecture. the proposed architecture has been implemented on Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 using the Xilinx Vivado design Suite. Results obtained from fixed point FPGA implementation were found to have high fidelity with floating point MATLAB simulations. Root Mean Squared Error (RMSE) of 1.704x10(-4) and a correlation of 99.84% was obtained.
Advanced Process Controls (APCs) are already widely and deeply established especially for industrial plants with high but complex optimization possibilities like chemical batch processes. the most famous representer i...
详细信息
ISBN:
(纸本)9781538682371
Advanced Process Controls (APCs) are already widely and deeply established especially for industrial plants with high but complex optimization possibilities like chemical batch processes. the most famous representer is the Model Predictive Controller (MPC). Unlike traditional controllers, an explicit process model is used to predict the future reaction of the system, given the control input and the past states. In order to find optimal control input, this prediction is used to find a sufficient solution for a dynamic optimization problem. Due to the complex algorithms needed to find a global minimum of the constrained quadratic problem using online optimization with real-time capabilities, a suitable performance of the underlying hardware is required. FPGA implementations are especially interesting due to the application specific data flow parallelization character of MPC tasks. In this research we focus on the algorithm development and reconfigurable hardware implementation of a generic MPC using High-Level-Synthesis (HLS).
Fast and robust lane detection algorithms are a fundamental technology for the development of advanced driver assistant systems (ADAS). Many projects in science and industry are using these kinds of algorithms. Unfort...
详细信息
ISBN:
(纸本)9781538682371
Fast and robust lane detection algorithms are a fundamental technology for the development of advanced driver assistant systems (ADAS). Many projects in science and industry are using these kinds of algorithms. Unfortunately, algorithm implementations mainly focus on standard PC based hardware. If and how the processing can be realized on embedded devices in real-time is often not considered. therefore, in this paper we present an extended evaluation of different optical based lane detection algorithms regarding both functional quality, and execution time on embedded devices. We compared five different lane detection algorithms for curved roads in combination with four different feature extraction filters. While the functional evaluation will be done by utilizing the F-measure metric, the execution time will be measured directly on embedded hardware. Furthermore, the algorithms were optimized to allow real-time processing. Our results show, that lane detection on images with a resolution of 1242 x 375 pixels can be done with up to 54 frames per second (fps) on an embedded ARM Cortex-A53 processor running at 1200 MHz.
暂无评论