We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as gpus. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements wi...
详细信息
We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as gpus. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements with duplicate column indices are aggregated on the fly. The main kernel merges small numbers of sparse rows at once using subwarps of threads to realize an early compression effect which reduces the overhead of global memory accesses. The performance is compared with a parallel CPU implementation as well as with three gpu-based implementations. Measurements performed for computing the matrix square for 21 sparse matrices show that the proposed method consistently outperforms the other methods. Analysis showed that the performance is achieved by utilizing the compression effect and the gpu caching architecture. An improved performance was also found for computing Galerkin products which are required by algebraic multigrid solvers. The performance was particularly good for seven-point stencil matrices arising in the context of diffuse optical imaging and the improved performance allows one to perform image reconstruction at higher resolution using the same computational resources.
In this paper, we present a new technique for displaying High Dynamic Range (HDR) images on Low Dynamic Range (LDR) displays in an efficient way on the gpu. The described process has three stages. First, the input ima...
详细信息
In this paper, we present a new technique for displaying High Dynamic Range (HDR) images on Low Dynamic Range (LDR) displays in an efficient way on the gpu. The described process has three stages. First, the input image is segmented into luminance zones. Second, the tone-mapping operator (TMO) that performs better in each zone is automatically selected. Finally, the resulting tone mapping (TM) outputs for each zone are merged, generating the final LDR output image. To establish the TMO that performs better in each luminance zone we conducted a preliminary psychophysical experiment using a set of HDR images and six different TMOs. We validated our composite technique on several (new) HDR images and conducted a further psychophysical experiment, using an HDR display as the reference that establishes the advantages of our hybrid three-stage approach over a traditional individual TMO. Finally, we present a gpu version, which is perceptually equal to the standard version but with much improved computational performance. (C) 2016 Published by Elsevier B.V.
This work presents a method to automatically detect and remove shadows in urban aerial images and its application in an aerospace remote monitoring system requiring near real-time processing. Our detection method gene...
详细信息
This work presents a method to automatically detect and remove shadows in urban aerial images and its application in an aerospace remote monitoring system requiring near real-time processing. Our detection method generates shadow masks and is accelerated by gpu programming. To obtain the shadow masks, we converted images from RGB to CIELCh model, calculated a modified Specthem ratio, and applied multilevel thresholding. Morphological operations were used to reduce shadow mask noise. The shadow masks are used in the process of removing shadows from the original images using the illumination ratio of the shadow/non-shadow regions. We obtained shadow detection accuracy of around 93% and shadow removal results comparable to the state-of-the-art while maintaining execution time under real-time constraints. (C) 2017 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
Graphs are common data structures for many applications, and efficient graph processing is a must for application performance. Recently, the graphics processing unit (gpu) has been adopted to accelerate various graph ...
详细信息
Graphs are common data structures for many applications, and efficient graph processing is a must for application performance. Recently, the graphics processing unit (gpu) has been adopted to accelerate various graph processing algorithms such as BFS and shortest paths. However, it is difficult to write correct and efficient gpu programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on gpus, we propose a programming framework called Medusa which enables developers to leverage the capabilities of gpus by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs and embraces a runtime system to automatically execute those APIs in parallel on the gpu. We develop a series of graph-centric optimizations based on the architecture features of gpus for efficiency. Additionally, Medusa is extended to execute on multiple gpus within a machine. Our experiments show that 1) Medusa greatly simplifies implementation of GPgpu programs for graph processing, with many fewer lines of source code written by developers and 2) the optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than manually tuned gpu graph operations.
The computing capabilities of current multi-core and many-core architectures have been used in crowd simulations for both enhancing crowd rendering and simulating continuum crowds. However, improving the scalability o...
详细信息
The computing capabilities of current multi-core and many-core architectures have been used in crowd simulations for both enhancing crowd rendering and simulating continuum crowds. However, improving the scalability of crowd simulation systems by exploiting the inherent parallelism of these architectures is still an open issue. In this paper, we propose different parallelization strategies for the collision check procedure that takes place in agent-based simulations. These strategies are designed for exploiting the parallelism in both multi-core and many-core architectures like graphic processing units (gpus). As for the many-core implementations, we analyse the bottlenecks of a previous gpu version of the collision check algorithm, proposing a new gpu version that removes the bottlenecks detected. In order to fairly compare the gpu with the multi-core implementations, we propose a parallel CPU version that uses read--copy update (RCU), a new synchronization method which significantly improves performance. We perform a comparison study of these different implementations. On the one hand, the comparison study shows the first performance evaluation of RCU in a real user-space application with complex data structures. On the other hand, the comparison shows that the gpu greatly accelerates the collision test with respect to any other implementation optimized for multi-core CPUs. In addition, we analyse the efficiency of the different implementations taking into account the theoretical performance and power consumption of each platform. The evaluation results show that the gpu-based implementation consumes less energy and provides a minimum speedup of 45x with respect to any of the CPU-based implementations. Since interactivity is a hard constraint in crowd simulations, this acceleration of the collision check process represents a significant improvement in the overall system throughput and response time. Therefore, the simulations are significantly accelerated, and the
Graph algorithms are challenging to implement due to their varying topology and irregular access patterns. Real-world graphs are dynamic in nature and routinely undergo edge and vertex additions, as well as, deletions...
详细信息
Graph algorithms are challenging to implement due to their varying topology and irregular access patterns. Real-world graphs are dynamic in nature and routinely undergo edge and vertex additions, as well as, deletions. Typical examples of dynamic graphs are social networks, collaboration networks, and road networks. Applying static algorithms repeatedly on dynamic graphs is inefficient. Further, due to the rapid growth of unstructured and semi-structured data, graph algorithms demand efficient parallel processing. Unfortunately, we know only a little about how to efficiently process dynamic graphs on massively parallel architectures such as gpus. Existing approaches to represent and process dynamic graphs are either not general or are inefficient. In this work, we propose a graph library for dynamic graph algorithms over a gpu-tailored graph representation and exploits the warp-cooperative work-sharing execution model. The library, named Meerkat, builds upon a recently proposed dynamic graph representation on gpus. This representation exploits a hashtable-based mechanism to store a vertex's neighborhood. Meerkat also enables fast iteration through a group of vertices, a pattern common and crucial for achieving performance in graph applications. Our framework supports dynamic edge additions and edge deletions, along with their batched versions. Based on the efficient iterative patterns encoded in Meerkat, we implement dynamic versions of popular graph algorithms such as breadth-first search, single-source shortest paths, triangle counting, PageRank, and weakly connected components. We evaluated our implementations over the ones in other publicly available dynamic graph data structures and frameworks: GPMA, Hornet, and faimGraph. Using a variety of real-world graphs, we observe that Meerkat significantly improves the efficiency of the underlying dynamic graph algorithm, outperforming these frameworks.
Use of high dynamic range (HDR) images and video in image processing and computer graphics applications is rapidly gaining popularity. However, creating and displaying high resolution HDR content on CPUs is a time con...
详细信息
Use of high dynamic range (HDR) images and video in image processing and computer graphics applications is rapidly gaining popularity. However, creating and displaying high resolution HDR content on CPUs is a time consuming task. Although some previous work focused on real-time tone mapping, implementation of a full HDR imaging (HDRI) pipeline on the gpu has not been detailed. In this article we aim to fill this gap by providing a detailed description of how the HDRI pipeline, from HDR image assembly to tone mapping, can be implemented exclusively on the gpu. We also explain the trade-offs that need to be made for improving efficiency and show timing comparisons for CPU versus gpu implementations of the HDRI pipeline.
In this work, we describe a new algorithm for rendering polygons defined by cubic Bezier curve segments in current gpus. Unlike other approaches, our algorithm has a simple preprocessing that does not require computin...
详细信息
In this work, we describe a new algorithm for rendering polygons defined by cubic Bezier curve segments in current gpus. Unlike other approaches, our algorithm has a simple preprocessing that does not require computing tessellations, and can be implemented in gpu as a geometry shader. The polygon is decomposed into a set of simplices which are individually rasterized into the stencil buffer to recreate the shape that is finally rendered in the frame buffer. Each simplex is rasterized using a fragment shader that evaluates the implicit equation of the Bezier curve to discard the pixels that fall outside it. The proposed method is simple, fast, robust and general, as it can handle curved polygons with holes, several components or self-intersections. (C) 2008 Elsevier Ltd. All rights reserved.
Real-time vehicle detection is one of the challenging problems for automotive and autonomous driving applications. Object detection using Deformable Parts Model (DPM) proved to be a promising approach providing higher...
详细信息
Real-time vehicle detection is one of the challenging problems for automotive and autonomous driving applications. Object detection using Deformable Parts Model (DPM) proved to be a promising approach providing higher detection accuracy. But the baseline DPM scheme spends 98% of its execution time in loop processing thus highlighting the drawback of higher computational cost for real time applications. In this paper, we have proposed a real time vehicle detection scheme for a low-powered embedded Graphics Processing Unit (gpu). The proposed scheme is based upon DPM approach using CUDA programming with different parallelization and loop unrolling schemes to reduce computational cost of DPM. Three loop unrolling schemes i.e. loosely unrolled, tightly unrolled and hybrid unrolled is proposed and implemented on two different datasets. Finally, we provided an optimal solution for vehicle detection with minimum execution time without having any impact on vehicle detection accuracy. We achieved a speedup of 3x to 5x as compared to state-of-the-art gpu implementation and 30x as compared to baseline CPU implementation of DPM on a low-powered automotive-grade embedded computing platform which features a Tegra K1 System on Chip (SOC), thus getting advantage of improved efficiency through parallel computation of CUDA.
Natural user interfaces (NUIs) provide human computer interaction (HCI) with natural and intuitive operation interfaces, such as using human gestures and voice. We have developed a real-time NUI engine architecture us...
详细信息
Natural user interfaces (NUIs) provide human computer interaction (HCI) with natural and intuitive operation interfaces, such as using human gestures and voice. We have developed a real-time NUI engine architecture using a web camera as a means of implementing NUI applications. The system captures video via the web camera, implements real-time image processing using graphic processing unit (gpu) programming. This paper describes the architecture of the engine and the real-virtual environment interaction methods, such as foreground segmentation and hand gesture recognition. These methods are implemented using gpu programming in order to realize real-time image processing for HCI. To verify the efficacy of our proposed NUI engine, we utilized it in the development and implementation of several mixed reality games and touch-less operation applications, using the developed NUI engine and the DirectX SDK. Our results confirm that the methods implemented by the engine operate in real time and the interactive operations are intuitive.
暂无评论