检索结果-内蒙古大学图书馆

Distributed computing for physics-based data-driven reduced modeling at scale: Application to a rotating detonation rocket engine

引用

computer PHYSICS COMMUNICATIONS 2025年 313卷

作者： Farcas, Ionut-Gabriel Gundevia, Rayomand P. Munipalli, Ramakanth Willcox, Karen E. Univ Texas Austin Oden Inst Computat Engn & Sci Austin TX 78712 USA Virginia Tech Dept Math Blacksburg VA USA Amentum Edwards AFB CA USA Air Force Res Lab Edwards AFB CA USA

high-performance computing (HPC) has revolutionized our ability to perform detailed simulations of complex real-world processes. A prominent contemporary example is from aerospace propulsion, where HPC is used for rotating detonation rocket engine (RDRE) simulations in support of the design of next-generation rocket engines;however, these simulations take millions of core hours even on powerful supercomputers, which makes them impractical for engineering tasks like design exploration and risk assessment. Data-driven reduced-order models (ROMs) aim to address this limitation by constructing computationally cheap yet sufficiently accurate approximations that serve as surrogates for the high-fidelity model. This paper contributes a distributed memory algorithm that achieves fast and scalable construction of predictive physics-based ROMs trained from sparse datasets of extremely large state dimension. The algorithm learns structured physics-based ROMs that approximate the dynamical systems underlying those datasets. This enables model reduction for problems at a scale and complexity that exceeds the capabilities of standard, serial approaches. We demonstrate our algorithm's scalability using up to 2,048 cores on the Frontera supercomputer at the Texas Advanced Computing Center. We focus on a real-world three-dimensional RDRE for which one millisecond of simulated physical time requires one million core hours on a supercomputer. Using a training dataset of 2,536 snapshots each of state dimension 76 million, our distributed algorithm enables the construction of a predictive data-driven reduced model in just 13 seconds on 2,048 cores on Frontera.

关键词： high-performance computing Data-driven modeling Scientific machine learning Large-scale simulations Rocket combustion

来源：评论

学校读者我要写书评

暂无评论

Development of New high performance computer Architectures and Improvements in Danish Eulerian Model for Long Range Transport of Air Pollutants 14th

Development of New High Performance Computer Architectures a...

引用

14th International conference on Large-Scale Scientific Computations (LSSC)

作者： Georgiev, Krassimir Zlatev, Zahari Lirkov, Ivan Bulgarian Acad Sci Inst Informat & Commun Technol Sofia Bulgaria Aarhus Univ Dept Environm Sci Roskilde Denmark

ISBN: (纸本)9783031562075;9783031562082

The paper is devoted to an analysis and comparison in the development of new high - performance computers and the improvements and development of new more reliable versions of the Danish Eulerian model for computer studying of the transport of the air pollutants over Europe and surrounding areas, studying some economical and agricultural problems, regional and global climate changing, etc.

关键词： high-performance computer architectures mathematical and computer modelling in environmental studies speed-up and efficiency of parallel algorithms partial and ordinary differential equations

来源：评论

学校读者我要写书评

暂无评论

AutoSparse: A Source-to-Source Format and Schedule Auto-Tuning Framework for Sparse Tensor program 42

AutoSparse: A Source-to-Source Format and Schedule Auto-Tuni...

引用

42nd International conference on computer Design

作者： Qu, Xiangjun Gong, Lei Lou, Wenqi Cheng, Qianyu Chen, Xianglan Wang, Chao Zhou, Xuehai Univ Sci & Technol China Dept Comp Sci & Technol Hefei Peoples R China

ISBN: (纸本)9798350380415;9798350380408

Sparse tensor computation plays a crucial role in modern deep learning workloads, and its expensive computational cost leads to a strong demand for high-performance operators. However, developing high-performance sparse operators is exceptionally challenging and tedious. Existing vendor operator libraries fail to keep pace with the evolving trends in new algorithms. Sparse tensor compilers simplify the development and optimization of operator, but existing work either requires significant engineering effort for tuning or suffers from limitations in search space and search strategies, which creates unavoidable cost and efficiency issues. In this paper, we propose AutoSparse, a source-to-source auto-tuning framework that targets sparse format and schedule for sparse tensor program. Firstly, AutoSparse designs a sparse tensor DSL based on dynamic computational graph at the front-end, and proposes a sparse tensor program computational pattern extraction and automatic design space generation scheme based on it. Second, AutoSparse's back-end designs an adaptive exploration strategy based on reinforcement learning and heuristic algorithm to find the optimal format and schedule configuration in a large-scale design space. Compared to prior work, developers using AutoSparse do not need to specify tuning design space relied on any compilation or hardware knowledge. We use the SuiteSparse dataset to compare with four state-of-the-art baselines, namely, the high-performance operator library MKL, the manually-based optimisation scheme ASpT, the auto-tuning-based framework TVM-S and WACO. The results demonstrate that AutoSparse achieves average speedups of 1.922.48x, 1.19-6.34x, and 1.47-2.23x for the SpMV, SpMM, and SDDMM operators, respectively. We will open-source AutoSparse at https://***/Qu- Xiangjun/AutoSparse.

关键词： Sparse Computation Sparse Tensor Compiler Code Generation and Optimizations Auto-Tuning

来源：评论

学校读者我要写书评

暂无评论

RAID45: Hybrid Parity-based RAID for Reducing Parity Write Wear on high-Density SSDs 42

RAID45: Hybrid Parity-based RAID for Reducing Parity Write W...

引用

42nd International conference on computer Design

作者： Liu, Jialin Liang, Yujiong Song, Yunpeng Shi, Liang East China Normal Univ Mol Engn Res Ctr Software Hardware Codesign Techn Shanghai Peoples R China East China Normal Univ Sch Comp Sci & Technol Shanghai Peoples R China Huazhong Univ Sci & Technol Wuhan Natl Lab Optoelect Wuhan Peoples R China

ISBN: (纸本)9798350380415;9798350380408

high-density solid-state drives (SSDs), such as triple-level cell (TLC) or quad-level cell (QLC) flash, are adopted in parity-based RAID systems to achieve high reliability with low redundancy. However, the parity writes cause high write wear, which is unfriendly to such high-density SSDs with low write endurance. Conversely, high-performance SSDs, such as ZNAND, XL-Flash, have high write endurance but their high cost per bit hinders their deployment in RAID. This paper proposed a novel hybrid RAID structure, RAID45, to reduce parity writes for highdensity SSDs. Specifically, RAID45 uses high-performance SSD to store the parity of write-intensive stripes to absorb as much of the wear of parity writes on high-density SSDs as possible. Experimental results on real platform show that RAID45 achieves encouraging parity write reduction on the high-density SSDs.

关键词： Flash-based SSDs

来源：评论

学校读者我要写书评

暂无评论

Ventus: A high-performance Open-source GPGPU Based on RISC-V and Its Vector Extension 42

Ventus: A High-performance Open-source GPGPU Based on RISC-V...

引用

42nd International conference on computer Design

作者： Li, Jingzhou Yang, Kexiang Jin, Chufeng Liu, Xudong Yang, Zexia Yu, Fangfei Shi, Yujie Ma, Mingyuan Kong, Li Zhou, Jing Wu, Hualin He, Hu Tsinghua Univ Sch Integrated Circuits Beijing Peoples R China Tsinghua Univ Int Innovat Ctr Shanghai Peoples R China Terapines Guangzhou Peoples R China

ISBN: (纸本)9798350380415;9798350380408

General-purpose Graphics Processing Unit (GPGPU) has become the most popular platform for accelerating modern applications such as Large Language Models and Generative AI, while the lack of advanced open-source hardware microarchitectures restricts the highperformance GPGPU research. In this work, we propose Ventus, a high-performance open-source GPGPU based on RISC-V with Vector Extension (RVV). Customized instructions and a holistic software toolchain are implemented to achieve high performance. Ventus is successfully deployed on an FPGA platform consisting of 4 Xilinx VU19P, scaling up to 16 Streaming Multiprocessors (SMs) with 256 warps. Results imply that Ventus possesses critical features of commercial GPGPUs and has achieved an average reduction of 83.9% in instruction count and 87.4% in CPI over the state-of-the-art open-source implementation. Ventus can be found on Github (https://***/THU-DSP-LAB/ventus-gpgpu).

关键词： GPGPU Open-source Design RISC-V Vector

来源：评论

学校读者我要写书评

暂无评论

HiFi4G: high-Fidelity Human performance Rendering via Compact Gaussian Splatting

HiFi4G: High-Fidelity Human Performance Rendering via Compac...

引用

IEEE/CVF conference on computer Vision and Pattern Recognition (CVPR)

作者： Jiang, Yuheng Shen, Zhehao Wang, Penghao Su, Zhuo Hong, Yu Zhang, Yingliang Yu, Jingyi Xu, Lan ShanghaiTech Univ Shanghai Peoples R China NeuDim Shanghai Peoples R China ByteDance Beijing Peoples R China DGene Baton Rouge LA USA

ISBN: (纸本)9798350353006

We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper, we present HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead. Project page: https://***/HiFi4G/.

关键词： 3D from multi-view and sensors Compact Gaussian Splatting Human performance Capture Volume Rendering

来源：评论

学校读者我要写书评

暂无评论

Testing the Unknown: A Framework for OpenMP Testing via Random program Generation

Testing the Unknown: A Framework for OpenMP Testing via Rand...

引用

2024 Workshops of the International conference for high performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Laguna, Ignacio Chapman, Patrick Parasyris, Konstantinos Georgakoudis, Giorgis Rubio-Gonzalez, Cindy Lawrence Livermore National Laboratory Center for Applied Scientific Computing United States University of California Department of Computer Science Davis United States

ISBN: (纸本)9798350355543

We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible random OpenMP tests using a grammar and implement our method as an extension of the Varity program generator. By generating 1,800 OpenMP tests, we find various performance anomalies and correctness issues when we apply them to three OpenMP implementations: GCC, Clang, and Intel. We also present several case studies that analyze the anomalies and give more details about the classes of tests that our approach creates. © 2024 IEEE.

关键词： differential testing OpenMP random program generation software testing

来源：评论

学校读者我要写书评

暂无评论

A Model of Two-Parameter Competitive Assessment of the Effectiveness of a Complex Sensorimotor Reaction of a computer Operator 26th

A Model of Two-Parameter Competitive Assessment of the Effec...

引用

26th International conference on Interactive Collaborative Learning (ICL) - Towards a Hybrid, Flexible and Socially Engaged higher Education / 52nd IGIP International conference on Engineering Pedagogy

作者： Kovalenko, Olena Bondarenko, Tetiana Kupriyanov, Oleksandr Yahupov, Vasyl Cardoso, Luis Ukrainian Engn Pedag Acad Kharkiv Ukraine Natl Def Univ Ukraine Kyiv Ukraine Polytech Inst Portalegre Portalegre Portugal

ISBN: (纸本)9783031533815;9783031533822

This article considers the program description, which is implemented through an operator's console or mobile phone to assess simultaneously the reaction of each participant in the experiment and the group as a whole to visual triggers that can vary in two ways: the digit value (from 0 to 9) - the digit color. Once the experiment is completed, individual and group performance scores of the complex sensorimotor reaction of the participants are displayed on the monitor screen, and the corresponding system database is generated for further processing and analysis of the results. Running the experiment with a program to test the sensorimotor reaction of a computer operator ensured high reliability in selecting computer operators and increased the technological capacity of the assessment by determining the efficiency of the complex sensorimotor reaction of a human operator rather than the time. This article provides the results of assessing the effectiveness of the complex sensorimotor reaction of a computer operator under group and individual conditions and presents their comparative values.

关键词： Man-machine system Sensorimotor Reaction of a computer Operator

来源：评论

学校读者我要写书评

暂无评论

SuperMap: high-performance and Flexible Memory-Mapped IO for Fast Storage Device 42

SuperMap: High-Performance and Flexible Memory-Mapped IO for...

引用

42nd International conference on computer Design

作者： Jia, Wenqing Jiang, Dejun Xiong, Jin Chinese Acad Sci Inst Comp Technol SKLP Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China

ISBN: (纸本)9798350380415;9798350380408

Memory-mapped IO offers several advantages over explicit read/write IO. It requires no system call, incurs minimal overhead in case of cache hits, and avoids extra data copies between user and kernel space. However, we still identify inefficiencies in current memory-mapped IO designs when meeting fast storage devices: i) the heavy IO stack in the page fault handler, ii) the suboptimal prefetching design, and iii) the inefficient eviction policy. To address these limitations, we present SuperMap, an alternative design for the memory-mapped IO in Linux, which specifically brings high performance and flexibility for fast devices. First, SuperMap designs a lightweight and asynchronous IO stack by directly accessing device, reducing software overhead significantly. Second, SuperMap introduces a fine-grained and application-customized prefetcher framework based on eBPF, further improving performance. Third, SuperMap proposes a hotness-aware eviction policy with the hardware assistance, trying to keep frequently accessed data in memory. Through evaluations using benchmarks and real-world applications, we demonstrate that SuperMap outperforms the state-of-the-art memory-mapped IO design (FastMap) up to 67%.

关键词： ebpf memory-mapped io prefetching ssd

来源：评论

学校读者我要写书评

暂无评论

Verified high performance Computing: The SyDPaCC Approach 16th

Verified High Performance Computing: The SyDPaCC Approach

引用

16th International conference on Verification and Evaluation of computer and Communication Systems (VECoS)

作者： Loulergue, Frederic Ed-Dbali, Ali Univ Orleans INSA CVL LIFO EA 4022 Orleans France

ISBN: (纸本)9783031497360;9783031497377

The SYDPACC framework for the COQ proof assistant is based on a transformational approach to develop verified efficient scalable parallel functional programs from specifications. These specifications are written as inefficient (potentially with a high computational complexity) sequential programs. We obtain efficient parallel programs implemented using algorithmic skeletons that are higher-order functions implemented in parallel on distributed data structures. The output programs are constructed step-by-step by applying transformation theorems. Leveraging COQ type classes, the application of transformation theorems is partly automated. The current version of the framework is presented and exemplified on the development of a parallel program for the maximum segment sum problem. This program is experimented on a parallel machine.

关键词： program transformation scalable parallel computing functional programming interactive theorem proving

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：