GPUs can accelerate edge scan performance of graph processing applications;however, the capacity of device memory on GPUs limits the size of graph to process, whereas efficient techniques to handle GPU memory overflow...
详细信息
ISBN:
(纸本)9781479955480
GPUs can accelerate edge scan performance of graph processing applications;however, the capacity of device memory on GPUs limits the size of graph to process, whereas efficient techniques to handle GPU memory overflows, including overflow detection and performance analysis in large-scale systems, are not well investigated. To address the problem, we propose a MapReduce-based out-of-core GPU memory management technique for processing large-scale graph applications on heterogeneous GPU-based supercomputers. Our proposed technique automatically handles memory overflows from GPUs by dynamically dividing graph data into multiple chunks and overlaps CPU-GPU data transfer and computation on GPUs as much as possible. Our experimental results on TSUBAME2.5 using 1024 nodes (12288 CPU cores, 3072 GPUs) exhibit that our GPU-based implementation performs 2.10x faster than running on CPU when graph data size does not fit on GPUs. We also study the performance characteristics of our proposed out-of-core GPU memory management technique, including application's performance and power efficiency of scale-up and scale-out approaches.
Volume rendering techniques have been used widely for high quality visualization of 3D data sets, especially in the fields of biomedical image processing. However, when rendering very large (out-of-core) volume data s...
详细信息
ISBN:
(纸本)9781424450756
Volume rendering techniques have been used widely for high quality visualization of 3D data sets, especially in the fields of biomedical image processing. However, when rendering very large (out-of-core) volume data sets, the conventional in-core volume rendering algorithms cannot run efficiently due to the impossibility of fitting the entire input data in the internal memory of a computer. In order to solve this problem, an efficient out-of-core volume rendering method based on volume ray casting and GPU acceleration, with a new out-of-core framework for visualizing large volume data sets, are proposed in this paper. The new framework gives a transparent and efficient access to the volume data set cached in the hard disk, while the new volume rendering method minimize the times of reloading volume data from the hard disk to the internal memory and perform comparatively fast high-quality volume rendering. The experimental results indicate that the new method and framework are effective and efficient for the visualization of out-of-core medical data sets.
Polygonal models acquired with emerging 3D scanning technology or from large scale CAD applications easily reach sizes of several gigabytes and do not fit in the address space of common 32-bit desktop PCs. In this pap...
详细信息
ISBN:
(纸本)9781581137095
Polygonal models acquired with emerging 3D scanning technology or from large scale CAD applications easily reach sizes of several gigabytes and do not fit in the address space of common 32-bit desktop PCs. In this paper we propose an out-of-core mesh compression technique that converts such gigantic meshes into a streamable, highly compressed representation. During decompression only a small portion of the mesh needs to be kept in memory at any time. As full connectivity information is available along the decompression boundaries, this provides seamless mesh access for incremental in-core processing on gigantic meshes. Decompression speeds are CPU-limited and exceed one million vertices and two million triangles per second on a 1.8 GHz Athlon processor. A novel external memory data structure provides our compression engine with transparent access to arbitrary large meshes. This out-of-core mesh was designed to accommodate the access pattern of our region-growing based compressor, which - in return - performs mesh queries as seldom and as local as possible by remembering previous queries as long as needed and by adapting its traversal slightly. The achieved compression rates are state-of-the-art.
We propose an out-of-core sorting acceleration technique, called xtr2sort, that deals with multi-level memory hierarchies of device memory (GPU), host memory (CPU), and semi-external non-volatile memory (Flash NVM) fo...
详细信息
ISBN:
(纸本)9781467390057
We propose an out-of-core sorting acceleration technique, called xtr2sort, that deals with multi-level memory hierarchies of device memory (GPU), host memory (CPU), and semi-external non-volatile memory (Flash NVM) for leveraging the high computational performance and memory bandwidth of GPUs, while offloading bandwidth-oblivious operations onto semi-external memory in order to significantly increasing the memory capacity available for the sort data, well beyond the that of the GPU as well as of the CPU. xtr2sort splits the input records into several chunks to fit in GPU device memory and overlaps (1) I/O operations between semi-external and host memory, (2) data transfers between host and device memory, and (3) sorting on the GPU device in an asynchronous manner for hiding latency. Experimental results show that xtr2sort can sort records up to 256 times larger than is possible with in-core GPU sorting and 16 times larger than is possible with in-core CPU sorting. xtr2sort also achieves 4.39 times faster than out-of-core CPU sorting using 72 threads on 204.8 giga records with int32_t, even though the input records could not fit in the host memory, let alone the GPU device memory. These results indicate that I/O chunking and latency hiding/overlapping maintains sorting performance, despite slow Flash NVM performance, by utilizing GPUs along with good algorithms. Such an approach is viable for accelerating future computing systems with deep memory hierarchies.
This paper describes a general framework for out-of-core rendering and management of massive terrain surfaces. The two key components of this framework are: view-dependent refinement of the terrain mesh and a simple s...
详细信息
This paper describes a general framework for out-of-core rendering and management of massive terrain surfaces. The two key components of this framework are: view-dependent refinement of the terrain mesh and a simple scheme for organizing the terrain data to improve coherence and reduce the number of paging events from external storage to main memory. Similar to several previously proposed methods for view-dependent refinement, we recursively subdivide a triangle mesh defined over regularly gridded data using longest-edge bisection. As part of this single, per-frame refinement pass, we perform triangle stripping, view frustum culling, and smooth blending of geometry using geomorphing. Meanwhile, our refinement framework supports a large class of error metrics, is highly competitive in terms of rendering performance, and is surprisingly simple to implement. Independent of our refinement algorithm, we also describe several data layout techniques for providing coherent access to the terrain data. By reordering the data in a manner that is more consistent with our recursive access pattern, we show that visualization of gigabyte-size data sets can be realized even on low-end, commodity PCs without the need for complicated and explicit data paging techniques. Rather, by virtue of dramatic improvements in multilevel cache coherence, we rely on the built-in paging mechanisms of the operating system to perform this task. The end result is a straightforward, simple-to-implement, pointerless indexing scheme that dramatically improves the data locality and paging performance over conventional matrix-based layouts.
Very large triangle meshes, i.e., meshes composed of millions of faces, are becoming common in many applications. Obviously, processing, rendering, transmission, and archiving of these meshes are not simple tasks. Mes...
详细信息
Very large triangle meshes, i.e., meshes composed of millions of faces, are becoming common in many applications. Obviously, processing, rendering, transmission, and archiving of these meshes are not simple tasks. Mesh simplification and LOD management are a rather mature technology that, in many cases, can efficiently manage complex data. But, only a few available systems can manage meshes characterized by a huge size: RAM size is often a severe bottleneck. In this paper, we present a data structure called Octree-based External Memory Mesh (OEMM). It supports external memory management of complex meshes, loading dynamically in main memory only the selected sections and preserving data consistency during local updates. The functionalities implemented on this data structure (simplification, detail preservation, mesh editing, visualization, and inspection) can be applied to huge triangles meshes on low-cost PC platforms. The time overhead due to the external memory management is affordable. Results of the test of our system on complex meshes are presented.
We present a new external memory multiresolution surface representation for massive polygonal meshes. Previous methods for building such data structures have relied on resampled surface data or employed memory intensi...
详细信息
We present a new external memory multiresolution surface representation for massive polygonal meshes. Previous methods for building such data structures have relied on resampled surface data or employed memory intensive construction algorithms that do not scale well. Our proposed representation combines efficient access to sampled surface data with access to the original surface. The construction algorithm for the surface representation exhibits memory requirements that are insensitive to the size of the input mesh, allowing it to process meshes containing hundreds of millions of polygons. The multiresolution nature of the surface representation has allowed us to develop efficient algorithms for view-dependent rendering, approximate collision detection, and adaptive simplification of massive meshes. The empirical performance of these algorithms demonstrates that the underlying data structure is a powerful and flexible tool for operating on massive geometric data.
We recently introduced an efficient multiresolution structure for distributing and rendering very large point sampled models on consumer graphics platforms [1]. The structure is based on a hierarchy of precomputed obj...
详细信息
We recently introduced an efficient multiresolution structure for distributing and rendering very large point sampled models on consumer graphics platforms [1]. The structure is based on a hierarchy of precomputed object-space point clouds, that are combined coarse-to-fine at rendering time to locally adapt sample densities according to the projected size in the image. The progressive block based refinement nature of the rendering traversal exploits on-board caching and object based rendering APIs, hides out-of-core data access latency through speculative prefetching, and lends itself well to incorporate backface, view frustum, and occlusion culling, as well as compression and view-dependent progressive transmission. The resulting system allows rendering of complex out-of-core models at high frame rates (over 60 M rendered points/second), supports network streaming, and is fundamentally simple to implement. We demonstrate the efficiency of the approach on a number of very large models, stored on local disks or accessed through a consumer level broadband network, including a massive 234 M samples isosurface generated by a compressible turbulence simulation and a 167 M samples model of Michelangelo's St. Matthew. Many of the details of our framework were presented in a previous study. We here provide a more thorough exposition, but also significant new material, including the presentation of a higher quality bottom-up construction method and additional qualitative and quantitative results. (C) 2004 Elsevier Ltd. All rights reserved.
We provide experimental evidence that current desktop computers feature enough computational power to solve large-scale dense linear algebra problems. While the high computational cost of the numerical methods for sol...
详细信息
We provide experimental evidence that current desktop computers feature enough computational power to solve large-scale dense linear algebra problems. While the high computational cost of the numerical methods for solving these problems can be tackled by the multiple cores of current processors, we propose to use the disk to store the large data structures associated with these applications. Our results also show that the limited amount of RAM and the comparatively slow disk of the system pose no problem for the solution of very large dense linear systems and linear least-squares problems. Thus, current desktop computers are revealed as an appealing, cost-effective platform for research groups that have to deal with large dense linear algebra problems but have no direct access to large computing facilities.
We present Grouper: an all-in-one compact file format, random-access data structure, and streamable representation for large triangle meshes. Similarly to the recently published SQuad representation, Grouper represent...
详细信息
We present Grouper: an all-in-one compact file format, random-access data structure, and streamable representation for large triangle meshes. Similarly to the recently published SQuad representation, Grouper represents the geometry and connectivity of a mesh by grouping vertices and triangles into fixed-size records, most of which store two adjacent triangles and a shared vertex. Unlike SQuad, however, Grouper interleaves geometry with connectivity and uses a new connectivity representation to ensure that vertices and triangles can be stored in a coherent order that enables memory-efficient sequential stream processing. We present a linear-time construction algorithm that allows streaming out Grouper meshes using a small memory footprint while preserving the initial ordering of vertices. As a part of this construction, we show how the problem of assigning vertices and triangles to groups reduces to a well-known NP-hard optimization problem, and present a simple yet effective heuristic solution that performs well in practice. Our array-based Grouper representation also doubles as a triangle mesh data structure that allows direct access to vertices and triangles. Storing only about two integer references per triangle-i.e., less than the three vertex references stored with each triangle in a conventional indexed mesh format-Grouper answers both incidence and adjacency queries in amortized constant time. Our compact representation enables data-parallel processing on multicore computers, instant partitioning and fast transmission for distributed processing, as well as efficient out-of-core access. We demonstrate the versatility and performance benefits of Grouper using a suite of example meshes and processing kernels.
暂无评论