We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the STARPU dynamic run...
详细信息
ISBN:
(纸本)9781728147345
We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the STARPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new out-of-core (OOC) feature from STARPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs.
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of T...
详细信息
ISBN:
(纸本)9781450323789
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of TO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sort Benchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the TO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary TO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.
Improving the quality of tetrahedral meshes is an important operation in many scientific computing applications. Meshes with badly shaped elements impact both the accuracy and convergence of scientific applications. S...
详细信息
ISBN:
(纸本)9781605581064
Improving the quality of tetrahedral meshes is an important operation in many scientific computing applications. Meshes with badly shaped elements impact both the accuracy and convergence of scientific applications. State-of-the-art mesh improvement techniques rely on sophisticated numerical optimization methods such as feasible Newton or conjugate gradient. Unfortunately, these methods cannot be practically applied to very large meshes due to their global nature. Our contribution in this paper is to describe a streaming framework for tetrahedral mesh optimization. This framework enables the optimization of meshes an order of magnitude larger than previously feasible, effectively optimizing meshes too large to fit in memory. Our results show that streaming is typically faster than global optimization and results in comparable mesh quality. This leads us to conclude that streaming extends mesh optimization to a new class of mesh sizes without compromising the quality of the optimized mesh.
Streamline tracing is an important tool used in many scientific domains for visualizing and analyzing flow fields. In this work, we examine a shared memory multi-threaded approach to streamline tracing that targets em...
详细信息
ISBN:
(纸本)9781479952151
Streamline tracing is an important tool used in many scientific domains for visualizing and analyzing flow fields. In this work, we examine a shared memory multi-threaded approach to streamline tracing that targets emerging data-intensive architectures. We take an in-depth look at data management strategies for streamline tracing in terms of issues, such as memory latency, bandwidth, and capacity limitations, that are applicable to future HPC platforms. We present two data management strategies for streamline tracing and evaluate their effectiveness for data-intensive architectures with locally attached Flash. We provide a comprehensive evaluation of both strategies by examining the strong and weak scaling implications of a variety of parameters. We also characterize the relationship between I/O concurrency and I/O efficiency to guide the selection of strategy based on use case. From our experiments, we find that using kernel-managed memory-map for out-of-core streamline tracing can outperform optimized user-managed cache.
The real-time display of huge geometry and imagery databases involves view-dependent approximations, typically through the use of precomputed hierarchies that are selectively refined at runtime. A classic motivating p...
详细信息
The real-time display of huge geometry and imagery databases involves view-dependent approximations, typically through the use of precomputed hierarchies that are selectively refined at runtime. A classic motivating problem is terrain visualization in which planetary databases involving billions of elevation and color values are displayed on PC graphics hardware at high frame rates. This paper introduces a new diamond data structure for the basic selective-refinement processing, which is a streamlined method of representing the well-known hierarchies of right triangles that have enjoyed much success in real-time, view-dependent terrain display. Regular-grid tiles are proposed as the payload data per diamond for both geometry and texture. The use of 4-8 grid refinement and coarsening schemes allows level-of-detail transitions that are twice as gradual as traditional quadtree-based hierarchies, as well as very high-quality low-pass filtering compared to subsampling-based hierarchies. An out-of-core storage organization is introduced based on Sierpinski indices per diamond, along with a tile preprocessing framework based on fine-to-coarse, same-level, and coarse-to-fine gathering operations. To attain optimal frame-to-frame coherence and processing-order priorities, dual split and merge queues are developed similar to the Realtime Optimally Adapting Meshes ( ROAM) Algorithm, as well as an adaptation of the ROAM frustum culling technique. Example applications of lake-detection and procedural terrain generation demonstrate the flexibility of the tile processing framework.
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scienti...
详细信息
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scientific computations encountered in computational chemistry and physics. The domain of our focus is electronic structure calculations, where many computationally intensive components are expressible as a set of tensor contractions. These scientific applications are extremely compute-intensive and consume significant computer resources at national supercomputer centers. The manual development of high-performance parallel programs for them is usually very tedious and time consuming. The TCE system is targeted at reducing the burden on application scientists, by having them specify computations in a high-level form, from which efficient parallel programs are automatically synthesized.@pqdt@break@The goal of this research is to develop an optimization framework to derive high-performance implementations for a set of given tensor contractions. In particular, the issues investigated include: (1) Development of an efficient in-memory parallel algorithm for a tensor contraction: A tensor contraction is essentially a generalized matrix multiplication involving multi-dimensional arrays. A novel parallel tensor contraction algorithm is developed by extending Cannon's memory-efficient parallel matrix multiplication algorithm. (2) Design of a performance-model driven framework for a parallel out-of-core tensor contraction: For a parallel out-of-core tensor contraction, besides the in-core parallel algorithm used, several other factors can affect the overall performance, such as the nested-loop structure (permutation), tile size selection, disk I/O placement and the data partitioning pattern. The best choice here depends on the characteristics of the target machine and the input data. We develop performance models for different parallel out-of-core alternatives and use p
We describe an efficient technique for out-of-core construction and accurate view-dependent visualization of very large surface models. The method uses a regular conformal hierarchy of tetrahedra to spatially partitio...
详细信息
We describe an efficient technique for out-of-core construction and accurate view-dependent visualization of very large surface models. The method uses a regular conformal hierarchy of tetrahedra to spatially partition the model. Each tetrahedral cell contains a precomputed simplified version of the original model, represented using cache coherent indexed strips for fast rendering. The representation is constructed during a fine-to-coarse simplification of the surface contained in diamonds (sets of tetrahedral cells sharing their longest edge). The construction preprocess operates out-of-core and parallelizes nicely. Appropriate boundary constraints are introduced in the simplification to ensure that all conforming selective subdivisions of the tetrahedron hierarchy lead to correctly matching surface patches. For each frame at runtime, the hierarchy is traversed coarse-to-fine to select diamonds of the appropriate resolution given the view parameters. The resulting system can interatively render high quality Views Of out-of-core models of hundreds of millions of triangles at over 40Hz (or 70M triangles/s) on current commodity graphics platforms.
Polygonal models acquired with emerging 3D scanning technology or from large scale CAD applications easily reach sizes of several gigabytes and do not fit in the address space of common 32-bit desktop PCs. In this pap...
详细信息
Polygonal models acquired with emerging 3D scanning technology or from large scale CAD applications easily reach sizes of several gigabytes and do not fit in the address space of common 32-bit desktop PCs. In this paper we propose an out-of-core mesh compression technique that converts such gigantic meshes into a streamable, highly compressed representation. During decompression only a small portion of the mesh needs to be kept in memory at any time. As full connectivity information is available along the decompression boundaries, this provides seamless mesh access for incremental in-core processing on gigantic meshes. Decompression speeds are CPU-limited and exceed one million vertices and two million triangles per second on a 1.8 GHz Athlon processor. A novel external memory data structure provides our compression engine with transparent access to arbitrary large meshes. This out-of-core mesh was designed to accommodate the access pattern of our region-growing based compressor, which - in return - performs mesh queries as seldom and as local as possible by remembering previous queries as long as needed and by adapting its traversal slightly. The achieved compression rates are state-of-the-art.
Volume rendering techniques have been used widely for high quality visualization of 3D datasets, especially 3D images. However, when rendering very large (out-of-core) datasets, some traditional in-core volume renderi...
详细信息
ISBN:
(纸本)9781479983407
Volume rendering techniques have been used widely for high quality visualization of 3D datasets, especially 3D images. However, when rendering very large (out-of-core) datasets, some traditional in-core volume rendering algorithms do not work due to the impossibility of fitting the entire input data in the main memory of a computer. Their simple out-of-core versions do not perform well either because of the slow speed external memory access overhead. In order to solve this problem, a semi-adaptive partitioning strategy and an efficient out-of-core volume rendering method based on it are proposed in this paper. By this new partitioning strategy, the out-of-core dataset is divided into small sub-blocks in different sizes, which are organized by a BSP tree. Each sub-block can be loaded into the fast texture memory of the graphics hardware and be rendered by certain volume rendering algorithm based on 3D texture. Then the final result is obtained by composing the projection images of all the sub-blocks from back to front after traveling the BSP tree according to the viewpoint position. The experimental results indicate that the new method is effective and efficient for the volume visualization of out-of-core 3D images.
We present a method for end-to-end out-of-core simplification and view-dependent visualization of large surfaces. The method consists of three phases: (1) memory insensitive simplification; (2) memory insensitive cons...
详细信息
ISBN:
(纸本)9781581136456
We present a method for end-to-end out-of-core simplification and view-dependent visualization of large surfaces. The method consists of three phases: (1) memory insensitive simplification; (2) memory insensitive construction of a multiresolution hierarchy; and (3) run-time, output-sensitive, view-dependent rendering and navigation of the mesh. The first two off-line phases are performed entirely on disk, and use only a small, constant amount of memory, whereas the run-time system pages in only the rendered parts of the mesh in a cache coherent manner. As a result, we are able to process and visualize arbitrarily large meshes given a sufficient amount of disk space; a constant multiple of the size of the input *** to recent work on out-of-core simplification, our memory insensitive method uses vertex clustering on a rectilinear octree grid to coarsen and create a hierarchy for the mesh, and a quadric error metric to choose vertex positions at all levels of resolution. We show how the quadric information can be used to concisely represent vertex position, surface normal, error, and curvature information for anisotropic view-dependent coarsening and silhouette *** run-time component of our system uses asynchronous rendering and view-dependent refinement driven by screen-space error and visibility. The system exploits frame-to-frame coherence and has been designed to allow preemptive refinement at the granularity of individual vertices to support refinement on a time *** results indicate a significant improvement in processing speed over previous methods for out-of-core multiresolution surface construction. Meanwhile, all phases of the method are disk and memory efficient, and are fairly straightforward to implement.
暂无评论