With the advancement of satellite communication technology, the maritime Internet of Things (IoT) has made significant progress. As a result, vast amounts of Automatic Identification System (AIS) data from global vess...
详细信息
With the advancement of satellite communication technology, the maritime Internet of Things (IoT) has made significant progress. As a result, vast amounts of Automatic Identification System (AIS) data from global vessels are transmitted to various maritime stakeholders through Maritime IoT systems. AIS data contains a large amount of dynamic and static information that requires effective and intuitive visualization for comprehensive analysis. However, two major deficiencies challenge current visualization models: a lack of consideration for interactions between distant pixels and low efficiency. To address these issues, we developed a large-scale vessel trajectories visualization algorithm, called the Non-local Kernel Density Estimation (NLKDE) algorithm, which incorporates a non-local convolution process. It accurately calculates the density distribution of vessel trajectories by considering correlations between distant pixels. Additionally, we implemented the NLKDE algorithm under a graphics Processing Unit (GPU) framework to enable parallel computing and improve operational efficiency. Comprehensive experiments using multiple vessel trajectory datasets show that the NLKDE algorithm excels in vessel trajectory density visualization tasks, and the GPU-accelerated framework significantly shortens the execution time to achieve real-time results. From both theoretical and practical perspectives, GPU-accelerated NLKDE provides technical support for real-time monitoring of vessel dynamics in complex water areas and contributes to constructing maritime intelligent transportation systems. The code for this paper can be accessed at: https://***/maohliang/GPU-NLKDE.
Scene representation networks (SRNs) have been recently proposed for compression and visualization of scientific data. However, state-of-the-art SRNs do not adapt the allocation of available network parameters to the ...
详细信息
Scene representation networks (SRNs) have been recently proposed for compression and visualization of scientific data. However, state-of-the-art SRNs do not adapt the allocation of available network parameters to the complex features found in scientific data, leading to a loss in reconstruction quality. We address this shortcoming with an adaptively placed multi-grid SRN (APMGSRN) and propose a domain decomposition training and inference technique for accelerated parallel training on multi-GPU systems. We also release an open-source neural volume rendering application that allows plug-and-play rendering with any PyTorch-based SRN. Our proposed APMGSRN architecture uses multiple spatially adaptive feature grids that learn where to be placed within the domain to dynamically allocate more neural network resources where error is high in the volume, improving state-of-the-art reconstruction accuracy of SRNs for scientific data without requiring expensive octree refining, pruning, and traversal like previous adaptive models. In our domain decomposition approach for representing large-scale data, we train an set of APMGSRNs in parallel on separate bricks of the volume to reduce training time while avoiding overhead necessary for an out-of-core solution for volumes too large to fit in GPU memory. After training, the lightweight SRNs are used for realtime neural volume rendering in our open-source renderer, where arbitrary view angles and transfer functions can be explored.
The large-scale motions in 3D turbulent channel flows, known as Turbulent Superstructures (TSS), play an essential role in the dynamics of small-scale structures within the turbulent boundary layer. However, as of tod...
详细信息
The large-scale motions in 3D turbulent channel flows, known as Turbulent Superstructures (TSS), play an essential role in the dynamics of small-scale structures within the turbulent boundary layer. However, as of today, there is no common agreement on the spatial and temporal relationships between these multiscale structures. We propose a novel space-time visualization technique for analyzing the temporal evolution of these multiscale structures in their spatial context and, thus, to further shed light on the conceptually different explanations of their dynamics. Since the temporal dynamics of TSS are believed to influence the structures in the turbulent boundary layer, we propose a combination of a 2D space-time velocity plot with an orthogonal 2D plot of projected 3D flow structures, which can interactively span the time and the space axis. Besides flow structures indicating the fluid motion, we propose showing the variations in derived fields as an additional source of explanation. The relationships between the structures in different spatial and temporal scales can be more effectively resolved by using various filtering operations and image registration algorithms. To reduce the information loss due to the non-injective nature of projection, spatial information is encoded into transparency or color. Since the proposed visualization is heavily demanding computational resources and memory bandwidth to stream unsteady flow fields and instantly compute derived 3D flow structures, the implementation exploits data compression, parallel computation capabilities, and high memory bandwidth on recent GPUs via the CUDA compute library.
Diagnosing the cluster-based performance of large-scale deep neural network (DNN) models during training is essential for improving training efficiency and reducing resource consumption. However, it remains challengin...
详细信息
Diagnosing the cluster-based performance of large-scale deep neural network (DNN) models during training is essential for improving training efficiency and reducing resource consumption. However, it remains challenging due to the incomprehensibility of the parallelization strategy and the sheer volume of complex data generated in the training processes. Prior works visually analyze performance profiles and timeline traces to identify anomalies from the perspective of individual devices in the cluster, which is not amenable for studying the root cause of anomalies. In this article, we present a visual analytics approach that empowers analysts to visually explore the parallel training process of a DNN model and interactively diagnose the root cause of a performance issue. A set of design requirements is gathered through discussions with domain experts. We propose an enhanced execution flow of model operators for illustrating parallelization strategies within the computational graph layout. We design and implement an enhanced Marey's graph representation, which introduces the concept of time-span and a banded visual metaphor to convey training dynamics and help experts identify inefficient training processes. We also propose a visual aggregation technique to improve visualization efficiency. We evaluate our approach using case studies, a user study and expert interviews on two large-scale models run in a cluster, namely, the PanGu-alpha 13B model (40 layers), and the Resnet model (50 layers).
Voxel-based segmentation volumes often store a large number of labels and voxels, and the resulting amount of data can make storage, transfer, and interactive visualization difficult. We present a lossless compression...
详细信息
Voxel-based segmentation volumes often store a large number of labels and voxels, and the resulting amount of data can make storage, transfer, and interactive visualization difficult. We present a lossless compression technique which addresses these challenges. It processes individual small bricks of a segmentation volume and compactly encodes the labelled regions and their boundaries by an iterative refinement scheme. The result for each brick is a list of labels, and a sequence of operations to reconstruct the brick which is further compressed using rANS-entropy coding. As the relative frequencies of operations are very similar across bricks, the entropy coding can use global frequency tables for an entire data set which enables efficient and effective parallel (de)compression. Our technique achieves high throughput (up to gigabytes per second both for compression and decompression) and strong compression ratios of about 1% to 3% of the original data set size while being applicable to GPU-based rendering. We evaluate our method for various data sets from different fields and demonstrate GPU-based volume visualization with on-the-fly decompression, level-of-detail rendering (with optional on-demand streaming of detail coefficients to the GPU), and a caching strategy for decompressed bricks for further performance improvement.
Contour trees describe the topology of level sets in scalar fields and are widely used in topological data analysis and visualization. A main challenge of utilizing contour trees for large-scale scientific data is the...
详细信息
Contour trees describe the topology of level sets in scalar fields and are widely used in topological data analysis and visualization. A main challenge of utilizing contour trees for large-scale scientific data is their computation at scale using high-performance computing. To address this challenge, recent work has introduced distributed hierarchical contour trees for distributed computation and storage of contour trees. However, effective use of these distributed structures in analysis and visualization requires subsequent computation of geometric properties and branch decomposition to support contour extraction and exploration. In this work, we introduce distributed algorithms for augmentation, hypersweeps, and branch decomposition that enable parallel computation of geometric properties, and support the use of distributed contour trees as query structures for scientific exploration. We evaluate the parallel performance of these algorithms and apply them to identify and extract important contours for scientific visualization.
We propose and discuss a paradigm that allows for expressing data-parallel rendering with the classically non-parallel ANARI API. We propose this as a new standard for data-parallel rendering, describe two different i...
详细信息
ISBN:
(纸本)9798331516932;9798331516925
We propose and discuss a paradigm that allows for expressing data-parallel rendering with the classically non-parallel ANARI API. We propose this as a new standard for data-parallel rendering, describe two different implementations of this paradigm, and use multiple sample integrations into existing applications to show how easy it is to adopt, and what can be gained from doing so.
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks c...
详细信息
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace-a chronological log of program activity during execution. As traces represent the full history, developers can discover a wide array of possibly previously unknown performance issues, making them an important artifact for exploratory performance analysis. However, interactive trace visualization is difficult due to issues of data size and complexity of meaning. Traces represent nanosecond-level events across many parallel processes, meaning the collected data is often large and difficult to explore. The rise of asynchronous task parallel programming paradigms complicates the relation between events and their probable cause. To address these challenges, we conduct a continuing design study in collaboration with high performance computing researchers. We develop diverse and hierarchical ways to navigate and represent execution trace data in support of their trace analysis tasks. Through an iterative design process, we developed Traveler, an integrated visualization platform for task parallel traces. Traveler provides multiple linked interfaces to help navigate trace data from multiple contexts. We evaluate the utility of Traveler through feedback from users and a case study, finding that integrating multiple modes of navigation in our design supported performance analysis tasks and led to the discovery of previously unknown behavior in a distributed array library.
This paper describes the adaptation to a distributed computational setting of a well-scaling parallel algorithm for computing Morse-Smale segmentations based on path compression. Additionally, we extend the algorithm ...
详细信息
ISBN:
(纸本)9798331516932;9798331516925
This paper describes the adaptation to a distributed computational setting of a well-scaling parallel algorithm for computing Morse-Smale segmentations based on path compression. Additionally, we extend the algorithm to efficiently compute connected components in distributed structured and unstructured grids, based either on the connectivity of the underlying mesh or a feature mask. Our implementation is seamlessly integrated with the distributed extension of the Topology ToolKit (TTK), ensuring robust performance and scalability. To demonstrate the practicality and efficiency of our algorithms, we conducted a series of scaling experiments on large-scale datasets, with sizes of up to 40963 vertices on up to 64 nodes and 768 cores.
The promotion of large-scale applications of reinforcement learning (RL) requires efficient training computation. While existing parallel RL frameworks encompass a variety of RL algorithms and parallelization techniqu...
详细信息
The promotion of large-scale applications of reinforcement learning (RL) requires efficient training computation. While existing parallel RL frameworks encompass a variety of RL algorithms and parallelization techniques, the excessively burdensome communication frameworks hinder the attainment of the hardware's limit for final throughput and training effects on a single desktop. In this article, we propose Spreeze, a lightweight parallel framework for RL that efficiently utilizes a single desktop hardware resource to approach the throughput limit. We asynchronously parallelize the experience sampling, network update, performance evaluation, and visualization operations, and employ multiple efficient data transmission techniques to transfer various types of data between processes. The framework can automatically adjust the parallelization hyperparameters based on the computing ability of the hardware device in order to perform efficient large-batch updates. Based on the characteristics of the "Actor-Critic" RL algorithm, our framework uses dual GPUs to independently update the network of actors and critics in order to further improve throughput. Simulation results show that our framework can achieve up to 15,000 Hz experience sampling and 370,000 Hz network update frame rate using only a personal desktop computer, which is an order of magnitude higher than other mainstream parallel RL frameworks, resulting in a 73% reduction of training time. Our work on fully utilizing the hardware resources of a single desktop computer is fundamental to enabling efficient large-scale distributed RL training.
暂无评论