high-performance computing (HPC) has revolutionized our ability to perform detailed simulations of complex real-world processes. A prominent contemporary example is from aerospace propulsion, where HPC is used for rot...
详细信息
high-performance computing (HPC) has revolutionized our ability to perform detailed simulations of complex real-world processes. A prominent contemporary example is from aerospace propulsion, where HPC is used for rotating detonation rocket engine (RDRE) simulations in support of the design of next-generation rocket engines;however, these simulations take millions of core hours even on powerful supercomputers, which makes them impractical for engineering tasks like design exploration and risk assessment. Data-driven reduced-order models (ROMs) aim to address this limitation by constructing computationally cheap yet sufficiently accurate approximations that serve as surrogates for the high-fidelity model. This paper contributes a distributed memory algorithm that achieves fast and scalable construction of predictive physics-based ROMs trained from sparse datasets of extremely large state dimension. The algorithm learns structured physics-based ROMs that approximate the dynamical systems underlying those datasets. This enables model reduction for problems at a scale and complexity that exceeds the capabilities of standard, serial approaches. We demonstrate our algorithm's scalability using up to 2,048 cores on the Frontera supercomputer at the Texas Advanced Computing Center. We focus on a real-world three-dimensional RDRE for which one millisecond of simulated physical time requires one million core hours on a supercomputer. Using a training dataset of 2,536 snapshots each of state dimension 76 million, our distributed algorithm enables the construction of a predictive data-driven reduced model in just 13 seconds on 2,048 cores on Frontera.
The paper is devoted to an analysis and comparison in the development of new high - performancecomputers and the improvements and development of new more reliable versions of the Danish Eulerian model for computer st...
详细信息
ISBN:
(纸本)9783031562075;9783031562082
The paper is devoted to an analysis and comparison in the development of new high - performancecomputers and the improvements and development of new more reliable versions of the Danish Eulerian model for computer studying of the transport of the air pollutants over Europe and surrounding areas, studying some economical and agricultural problems, regional and global climate changing, etc.
Sparse tensor computation plays a crucial role in modern deep learning workloads, and its expensive computational cost leads to a strong demand for high-performance operators. However, developing high-performance spar...
详细信息
ISBN:
(纸本)9798350380415;9798350380408
Sparse tensor computation plays a crucial role in modern deep learning workloads, and its expensive computational cost leads to a strong demand for high-performance operators. However, developing high-performance sparse operators is exceptionally challenging and tedious. Existing vendor operator libraries fail to keep pace with the evolving trends in new algorithms. Sparse tensor compilers simplify the development and optimization of operator, but existing work either requires significant engineering effort for tuning or suffers from limitations in search space and search strategies, which creates unavoidable cost and efficiency issues. In this paper, we propose AutoSparse, a source-to-source auto-tuning framework that targets sparse format and schedule for sparse tensor program. Firstly, AutoSparse designs a sparse tensor DSL based on dynamic computational graph at the front-end, and proposes a sparse tensor program computational pattern extraction and automatic design space generation scheme based on it. Second, AutoSparse's back-end designs an adaptive exploration strategy based on reinforcement learning and heuristic algorithm to find the optimal format and schedule configuration in a large-scale design space. Compared to prior work, developers using AutoSparse do not need to specify tuning design space relied on any compilation or hardware knowledge. We use the SuiteSparse dataset to compare with four state-of-the-art baselines, namely, the high-performance operator library MKL, the manually-based optimisation scheme ASpT, the auto-tuning-based framework TVM-S and WACO. The results demonstrate that AutoSparse achieves average speedups of 1.922.48x, 1.19-6.34x, and 1.47-2.23x for the SpMV, SpMM, and SDDMM operators, respectively. We will open-source AutoSparse at https://***/Qu- Xiangjun/AutoSparse.
high-density solid-state drives (SSDs), such as triple-level cell (TLC) or quad-level cell (QLC) flash, are adopted in parity-based RAID systems to achieve high reliability with low redundancy. However, the parity wri...
详细信息
ISBN:
(纸本)9798350380415;9798350380408
high-density solid-state drives (SSDs), such as triple-level cell (TLC) or quad-level cell (QLC) flash, are adopted in parity-based RAID systems to achieve high reliability with low redundancy. However, the parity writes cause high write wear, which is unfriendly to such high-density SSDs with low write endurance. Conversely, high-performance SSDs, such as ZNAND, XL-Flash, have high write endurance but their high cost per bit hinders their deployment in RAID. This paper proposed a novel hybrid RAID structure, RAID45, to reduce parity writes for highdensity SSDs. Specifically, RAID45 uses high-performance SSD to store the parity of write-intensive stripes to absorb as much of the wear of parity writes on high-density SSDs as possible. Experimental results on real platform show that RAID45 achieves encouraging parity write reduction on the high-density SSDs.
General-purpose Graphics Processing Unit (GPGPU) has become the most popular platform for accelerating modern applications such as Large Language Models and Generative AI, while the lack of advanced open-source hardwa...
详细信息
ISBN:
(纸本)9798350380415;9798350380408
General-purpose Graphics Processing Unit (GPGPU) has become the most popular platform for accelerating modern applications such as Large Language Models and Generative AI, while the lack of advanced open-source hardware microarchitectures restricts the highperformance GPGPU research. In this work, we propose Ventus, a high-performance open-source GPGPU based on RISC-V with Vector Extension (RVV). Customized instructions and a holistic software toolchain are implemented to achieve highperformance. Ventus is successfully deployed on an FPGA platform consisting of 4 Xilinx VU19P, scaling up to 16 Streaming Multiprocessors (SMs) with 256 warps. Results imply that Ventus possesses critical features of commercial GPGPUs and has achieved an average reduction of 83.9% in instruction count and 87.4% in CPI over the state-of-the-art open-source implementation. Ventus can be found on Github (https://***/THU-DSP-LAB/ventus-gpgpu).
We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In t...
详细信息
ISBN:
(纸本)9798350353006
We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper, we present HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead. Project page: https://***/HiFi4G/.
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly ge...
详细信息
This article considers the program description, which is implemented through an operator's console or mobile phone to assess simultaneously the reaction of each participant in the experiment and the group as a who...
详细信息
ISBN:
(纸本)9783031533815;9783031533822
This article considers the program description, which is implemented through an operator's console or mobile phone to assess simultaneously the reaction of each participant in the experiment and the group as a whole to visual triggers that can vary in two ways: the digit value (from 0 to 9) - the digit color. Once the experiment is completed, individual and group performance scores of the complex sensorimotor reaction of the participants are displayed on the monitor screen, and the corresponding system database is generated for further processing and analysis of the results. Running the experiment with a program to test the sensorimotor reaction of a computer operator ensured high reliability in selecting computer operators and increased the technological capacity of the assessment by determining the efficiency of the complex sensorimotor reaction of a human operator rather than the time. This article provides the results of assessing the effectiveness of the complex sensorimotor reaction of a computer operator under group and individual conditions and presents their comparative values.
Memory-mapped IO offers several advantages over explicit read/write IO. It requires no system call, incurs minimal overhead in case of cache hits, and avoids extra data copies between user and kernel space. However, w...
详细信息
ISBN:
(纸本)9798350380415;9798350380408
Memory-mapped IO offers several advantages over explicit read/write IO. It requires no system call, incurs minimal overhead in case of cache hits, and avoids extra data copies between user and kernel space. However, we still identify inefficiencies in current memory-mapped IO designs when meeting fast storage devices: i) the heavy IO stack in the page fault handler, ii) the suboptimal prefetching design, and iii) the inefficient eviction policy. To address these limitations, we present SuperMap, an alternative design for the memory-mapped IO in Linux, which specifically brings highperformance and flexibility for fast devices. First, SuperMap designs a lightweight and asynchronous IO stack by directly accessing device, reducing software overhead significantly. Second, SuperMap introduces a fine-grained and application-customized prefetcher framework based on eBPF, further improving performance. Third, SuperMap proposes a hotness-aware eviction policy with the hardware assistance, trying to keep frequently accessed data in memory. Through evaluations using benchmarks and real-world applications, we demonstrate that SuperMap outperforms the state-of-the-art memory-mapped IO design (FastMap) up to 67%.
The SYDPACC framework for the COQ proof assistant is based on a transformational approach to develop verified efficient scalable parallel functional programs from specifications. These specifications are written as in...
详细信息
ISBN:
(纸本)9783031497360;9783031497377
The SYDPACC framework for the COQ proof assistant is based on a transformational approach to develop verified efficient scalable parallel functional programs from specifications. These specifications are written as inefficient (potentially with a high computational complexity) sequential programs. We obtain efficient parallel programs implemented using algorithmic skeletons that are higher-order functions implemented in parallel on distributed data structures. The output programs are constructed step-by-step by applying transformation theorems. Leveraging COQ type classes, the application of transformation theorems is partly automated. The current version of the framework is presented and exemplified on the development of a parallel program for the maximum segment sum problem. This program is experimented on a parallel machine.
暂无评论