One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the sharedmemory bandwidth, nor by the shared m...
详细信息
One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the sharedmemory bandwidth, nor by the sharedmemory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chipmemory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.
Graphics processing units (GPUs) have been utilized to improve the processing speed of many conventional data mining algorithms. DBSCAN, a popular clustering algorithm that has been often used in practice, was extende...
详细信息
Graphics processing units (GPUs) have been utilized to improve the processing speed of many conventional data mining algorithms. DBSCAN, a popular clustering algorithm that has been often used in practice, was extended to execute on a GPU. However, existing GPU-based DBSCAN extensions still have impediments in that the distances from all objects need to be repeatedly computed to find the neighbor objects and the objects and intermediate clustering results are stored in costly off-chipmemory of the GPU. This paper proposes CudaSCAN, a novel algorithm that improves the efficiency of DBSCAN by making better use of the GPU. CudaSCAN consists of three phases: (1) partitioning the entire dataset into sub-regions of size of an integer multiple of the on-chip shared memory size in the GPU;(2) local clustering within sub-regions in parallel;and (3) merging the local clustering results. CudaSCAN allows an overlap between sub-regions to ensure independent, parallel local clustering in each sub-region, which in turn enables for objects and/or intermediate results to be stored in on-chip shared memory that has an access cost a few hundred times cheaper than that of off-chip global memory. The independence also enables for merging to be parallelized. This paper proves the correctness of CudaSCAN, and according to our extensive experiments, CudaSCAN outperforms CUDA-DClust, a previous GPU-based DBSCAN extension, by up to 163.6 times. (C) 2014 Elsevier Inc. All rights reserved.
暂无评论