Reconfigurable computing devices can increase the performance of compute intensive algorithms by implementing application specific co-processor architectures. The power cost for this performance gain is often an order...
详细信息
Reconfigurable computing devices can increase the performance of compute intensive algorithms by implementing application specific co-processor architectures. The power cost for this performance gain is often an order of magnitude less than that of modern CPUs and GPUs. Exploiting the potential of reconfigurable devices such as Field-Programmable Gate Arrays (FPGAs) is typically a complex and tedious hardware engineering task. Recently the major FPGA vendors (Altera, and Xilinx) have released their own high-level design tools, which have great potential for rapid development of FPGA based custom accelerators. In this paper, we will evaluate Altera's openCL Software Development Kit, and Xilinx's Vivado High Level Sythesis tool. These tools will be compared for their performance, logic utilisation, and ease of development for the test case of a tri-diagonal linear system solver.
In this paper, we propose a fast parallelized implementation of face recognition based on local binary pattern (LBP) using open computing language (openCL), which is a novel open standard for heterogeneous computing. ...
详细信息
ISBN:
(纸本)9781479983544
In this paper, we propose a fast parallelized implementation of face recognition based on local binary pattern (LBP) using open computing language (openCL), which is a novel open standard for heterogeneous computing. The LBP as well as its modifications CLBP (Circle Local Binary Patterns) and ULB (Uniform Local Binary Patterns) have been developed on a CPU and GPU using openCL. This paper also addresses several optimizations and parallelization problems related to the algorithms, such as LBP features extraction and Chi-dist computing to maximize the resource exploitation available on GPU. The optimizations are realized based on openCL memory and execution model. The experimental results based on the implementation on AMD GPU processor show that the GPU parallel implementation is about 50 times faster than the counterpart on CPU.
Modern computing platforms offer increasing levels of parallelism for fast execution of different signal processing tasks. In this paper, we develop and elaborate on a digital front-end concept for an IEEE 802.11ac re...
详细信息
ISBN:
(纸本)9781467385770
Modern computing platforms offer increasing levels of parallelism for fast execution of different signal processing tasks. In this paper, we develop and elaborate on a digital front-end concept for an IEEE 802.11ac receiver with 80 MHz bandwidth where parallel processing is adopted in multiple ways. First, the inherent structure of the 802.11ac waveform is utilized such that it is divided, through time-domain digital filtering and decimation, to two parallel 40 MHz signals that can be processed further in parallel using smaller-size FFTs and, e.g, legacy 802.11n digital receiver chains. This filtering task is very challenging, as the latency and the cyclic prefix budget of the receiver cannot be compromised, and because the number of unused subcarriers in the middle of the 80 MHz signal is only three, thus necessitating very narrow transition bandwidth in the deployed filters. Both linear and circular filtering based multirate channelization architectures are developed and reported, together with the corresponding filter coefficient optimization. Also, full radio link performance simulations with commonly adopted indoor WiFi channel profiles are provided, verifying that the channelization does not degrade the overall link performance. Then, both C and openCL software implementations of the processing are developed and simulated for comparison purposes on an Intel CPU, to demonstrate that the parallelism provided by the openCL will result in substantially faster realization. Furthermore, we provide complete software implementation results in terms of time, number of clock cycles, power, and energy consumption on the ARM Mali GPU with half precision floating-point arithmetic along with the ARM Cortex A7 CPU.
暂无评论