This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computat...
详细信息
This paper shows how the performance of singular value decomposition (SVD) is enhanced through the exploitation of ILP, TLP, and DLP on Intel multi-core processors using superscalar execution, multi-threading computation, and streaming SIMD extensions, respectively. To facilitate the exploitation of TLP on multiple execution cores, the well-known cyclic one-sided Jacobi algorithm is restructured to work in parallel. On two dual-core Intel Xeon processors with hyper-threading technology running at 3.0 GHz, our results show that the multi-threaded implementation of one-sided Jacobi SVD gives about four times faster than the single-threaded superscalar implementation. Furthermore, the multi-threaded SIMD implementation speeds up the execution of single threaded one-sided Jacobi by a factor of 10, which is close to the ideal speedup. On a reasonable large matrix size fitted in the L2 cache, our results show a performance of 11 kA w GFLOPS (double-precision) is achieved on the target system through the exploitation cr, of ILP, TLP, and DLP as well as memory hierarchy.
The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in...
详细信息
ISBN:
(纸本)9780769536057
The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency bandwidth and power computation. In this paper, we discuss the parallelization, implementation and performance of a video surveillance application on the IBM Cell BE. We report the Video surveillance application's performance measured on a computer with one Cell processor and with varying numbers of synergic processor engines enabled. These results were compared to the results obtained on the Cell's single PPE with all 8 SPEs disabled The results indicate that our video surveillance application performs approximately 16 times faster on the Cell BE than modern RISC processors by processing input data from five separate surveillance video streams in parallel.
We present the development of a novel high-performance face detection system using a neural network-based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in d...
详细信息
We present the development of a novel high-performance face detection system using a neural network-based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in detail along with experimental assessment. Our parallelization strategy starts with one level of threads and moves to the exploitation of nested parallel regions in order to further improve, by up to 19%, the image-processing capability. The presented system is able to process images in real time (38 images/sec) by sustaining almost linear speedups on a system with a quad-core processor and a particular OpenMP runtime library. Copyright (C) 2009 John Wiley & Sons, Ltd.
暂无评论