Streamline tracing is an important tool used in many scientific domains for visualizing and analyzing flow fields. In this work, we examine a shared memory multi-threaded approach to streamline tracing that targets em...
详细信息
ISBN:
(纸本)9781479952151
Streamline tracing is an important tool used in many scientific domains for visualizing and analyzing flow fields. In this work, we examine a shared memory multi-threaded approach to streamline tracing that targets emerging data-intensive architectures. We take an in-depth look at data management strategies for streamline tracing in terms of issues, such as memory latency, bandwidth, and capacity limitations, that are applicable to future HPC platforms. We present two data management strategies for streamline tracing and evaluate their effectiveness for data-intensive architectures with locally attached Flash. We provide a comprehensive evaluation of both strategies by examining the strong and weak scaling implications of a variety of parameters. We also characterize the relationship between I/O concurrency and I/O efficiency to guide the selection of strategy based on use case. From our experiments, we find that using kernel-managed memory-map for out-of-core streamline tracing can outperform optimized user-managed cache.
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of T...
详细信息
ISBN:
(纸本)9781450323789
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of TO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sort Benchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the TO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary TO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.
We provide experimental evidence that current desktop computers feature enough computational power to solve large-scale dense linear algebra problems. While the high computational cost of the numerical methods for sol...
详细信息
We provide experimental evidence that current desktop computers feature enough computational power to solve large-scale dense linear algebra problems. While the high computational cost of the numerical methods for solving these problems can be tackled by the multiple cores of current processors, we propose to use the disk to store the large data structures associated with these applications. Our results also show that the limited amount of RAM and the comparatively slow disk of the system pose no problem for the solution of very large dense linear systems and linear least-squares problems. Thus, current desktop computers are revealed as an appealing, cost-effective platform for research groups that have to deal with large dense linear algebra problems but have no direct access to large computing facilities.
We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions arising in quantum chemistry computations. These loops operate on arrays too...
详细信息
We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions arising in quantum chemistry computations. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling of loops and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a nonlinear optimization problem and use a discrete constraint solver to generate optimized out-of-core code. The solution generated using the discrete constraint solver consistently outperforms other approaches by up to a factor of four. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the approach. (c) 2005 Published by Elsevier Inc.
Volume rendering techniques have been used widely for high quality visualization of 3D data sets, especially in the fields of biomedical image processing. However, when rendering very large (out-of-core) volume data s...
详细信息
ISBN:
(纸本)9781424450756
Volume rendering techniques have been used widely for high quality visualization of 3D data sets, especially in the fields of biomedical image processing. However, when rendering very large (out-of-core) volume data sets, the conventional in-core volume rendering algorithms cannot run efficiently due to the impossibility of fitting the entire input data in the internal memory of a computer. In order to solve this problem, an efficient out-of-core volume rendering method based on volume ray casting and GPU acceleration, with a new out-of-core framework for visualizing large volume data sets, are proposed in this paper. The new framework gives a transparent and efficient access to the volume data set cached in the hard disk, while the new volume rendering method minimize the times of reloading volume data from the hard disk to the internal memory and perform comparatively fast high-quality volume rendering. The experimental results indicate that the new method and framework are effective and efficient for the visualization of out-of-core medical data sets.
We compare, in the same framework, out-of-core implementations of the Cholesky factorization algorithm. The candidate implementations are the classical blocked left-looking variant and a more recent recursive formulat...
详细信息
We compare, in the same framework, out-of-core implementations of the Cholesky factorization algorithm. The candidate implementations are the classical blocked left-looking variant and a more recent recursive formulation. Both have been implemented for real positive definite matrices: the former in the parallel out-of-core linear algebra package (POOCLAPACK) library and the latter in the scalable out-of-core linear algebra computations (SOLAR) library. We perform a theoretical analysis of the amount of input/output (I/O) operations required by each variant. We consider alternatives for the left-looking algorithm: the one-tile and two-tiles approaches. We show that when main memory is restricted, the one-tile approach yields less I/O volume. We then show that the left-looking implementation requires less I/O volume than the recursive variant. We have implemented all for complex matrices, and we report on numerical experiments.
We report on a multiresolution rendering system driving light field displays based on a specially arranged array of projectors and a holographic screen. The system gives multiple freely moving naked-eye viewers the il...
详细信息
We report on a multiresolution rendering system driving light field displays based on a specially arranged array of projectors and a holographic screen. The system gives multiple freely moving naked-eye viewers the illusion of seeing and manipulating 3D objects with continuous viewer-independent parallax. Multiresolution techniques which take into account the displayed light field geometry are employed to dynamically adapt model resolution to display capabilities and timing constraints. The approach is demonstrated on two different scales: a desktop PC driving a 7.4 Mbeams TV-size display, and a cluster-parallel solution driving a large (1.6 x 0.9 m) 35Mbeams display which supports a room-size working space. In both cases, massive meshes of tens of millions of triangles are manipulated at interactive rates. (C) 2008 Elsevier Ltd. All rights reserved.
Improving the quality of tetrahedral meshes is an important operation in many scientific computing applications. Meshes with badly shaped elements impact both the accuracy and convergence of scientific applications. S...
详细信息
ISBN:
(纸本)9781605581064
Improving the quality of tetrahedral meshes is an important operation in many scientific computing applications. Meshes with badly shaped elements impact both the accuracy and convergence of scientific applications. State-of-the-art mesh improvement techniques rely on sophisticated numerical optimization methods such as feasible Newton or conjugate gradient. Unfortunately, these methods cannot be practically applied to very large meshes due to their global nature. Our contribution in this paper is to describe a streaming framework for tetrahedral mesh optimization. This framework enables the optimization of meshes an order of magnitude larger than previously feasible, effectively optimizing meshes too large to fit in memory. Our results show that streaming is typically faster than global optimization and results in comparable mesh quality. This leads us to conclude that streaming extends mesh optimization to a new class of mesh sizes without compromising the quality of the optimized mesh.
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scienti...
详细信息
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scientific computations encountered in computational chemistry and physics. The domain of our focus is electronic structure calculations, where many computationally intensive components are expressible as a set of tensor contractions. These scientific applications are extremely compute-intensive and consume significant computer resources at national supercomputer centers. The manual development of high-performance parallel programs for them is usually very tedious and time consuming. The TCE system is targeted at reducing the burden on application scientists, by having them specify computations in a high-level form, from which efficient parallel programs are automatically synthesized.@pqdt@break@The goal of this research is to develop an optimization framework to derive high-performance implementations for a set of given tensor contractions. In particular, the issues investigated include: (1) Development of an efficient in-memory parallel algorithm for a tensor contraction: A tensor contraction is essentially a generalized matrix multiplication involving multi-dimensional arrays. A novel parallel tensor contraction algorithm is developed by extending Cannon's memory-efficient parallel matrix multiplication algorithm. (2) Design of a performance-model driven framework for a parallel out-of-core tensor contraction: For a parallel out-of-core tensor contraction, besides the in-core parallel algorithm used, several other factors can affect the overall performance, such as the nested-loop structure (permutation), tile size selection, disk I/O placement and the data partitioning pattern. The best choice here depends on the characteristics of the target machine and the input data. We develop performance models for different parallel out-of-core alternatives and use p
Unstructured tetrahedral meshes are commonly used in scientific computing to represent scalar, vector, and tensor fields in three dimensions. Visualization of these meshes can be difficult to perform interactively due...
详细信息
Unstructured tetrahedral meshes are commonly used in scientific computing to represent scalar, vector, and tensor fields in three dimensions. Visualization of these meshes can be difficult to perform interactively due to their size and complexity. By reducing the size of the data, we can accomplish real-time visualization necessary for scientific analysis. We propose a two-step approach for streaming simplification of large tetrahedral meshes. Our algorithm arranges the data on disk in a streaming, I/O-efficient format that allows coherent access to the tetrahedral cells. A quadric-based simplification is sequentially performed on small portions of the mesh in-core. Our output is a coherent streaming mesh which facilitates future processing. Our technique is fast, produces high quality approximations, and operates out-of-core to process meshes too large for main memory.
暂无评论