The large amount of floating-point data generated by scientific applications makes data compression essential for 110 performance and efficient storage. However, floating-point data is difficult to compress losslessly...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The large amount of floating-point data generated by scientific applications makes data compression essential for 110 performance and efficient storage. However, floating-point data is difficult to compress losslessly, and most compression algorithms are only effective on some files. In this paper, we study the benefit of compressing each file with a potentially different algorithm. For this purpose, we created AdaptiveFC, which is based on a tool that can chain data transformations together to generate millions of compression algorithms. AdaptiveFC uses a genetic algorittm to quickly identify an effective compressor in this vast search space for a given file. A comparison of AdaptiveFC to 15 leading lossless CPU compressors on 77 files from 6 datasets in the STAIDench suite shows that per-file compression yields higher compression ratios on average than any individual algorithm.
Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems of cloud, edge, and HPC. However, users often rely on their experience to determine which variables to checkpoint, as there is c...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems of cloud, edge, and HPC. However, users often rely on their experience to determine which variables to checkpoint, as there is currently no benchmark that can provide a reference. This can result in checkpointing redundant or even incorrect variables. To address this issue, we propose a benchmark suite that includes critical variables for checkpointing, which have been manually identified, and a method for identifying those critical variables, with 20 representative HPC applications. Our method involves analyzing data dependency between variables to identify critical variables analytically. We verify the identified variables' correctness with a widely used C/R library FTI by an ablation study. With our benchmark suite and data dependency analysis, HPC practitioners now have a reference for identifying checkpointing variables and better knowledge of what kind of variables to checkpoint.
The ubiquity of multicore processors, cloud computing, and hardware accelerators have elevated parallel and distributed computing (PDC) topics into fundamental building blocks of the undergraduate CS curriculum. There...
详细信息
ISBN:
(纸本)9781665497473
The ubiquity of multicore processors, cloud computing, and hardware accelerators have elevated parallel and distributed computing (PDC) topics into fundamental building blocks of the undergraduate CS curriculum. Therefore, it is increasingly important for students to learn a common core of introductory PDC topics and develop parallel thinking skills early in their CS studies. We present the curricular design, pedagogy, and goals of an introductory-level course on computer systems that introduces parallel computing to students who have only a CS1 background. Our course focuses on three curricular goals that serve to integrate the ACM-ieee TCPP guidelines throughout: a vertical slice through the computer of how it runs a program;evaluating system costs associated with running a program;and taking advantage of the power of parallel computing. We elaborate on the goals and details of our course's key modules, and we discuss our pedagogical approach that includes active-learning techniques. We find that the PDC foundation gained through early exposure in this course helps students gain confidence in their ability to expand and apply their understanding of PDC concepts throughout their CS education.
Balancing robustness and computational efficiency in machine learning models is challenging, especially in settings with limited resources like mobile and IoT devices. This study introduces Adaptive and Localized Adve...
详细信息
Understanding the performance behavior of parallelapplications is important in many ways, but doing so is not easy. Most open source analysis tools are written for the command line. We are building on these proven to...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Understanding the performance behavior of parallelapplications is important in many ways, but doing so is not easy. Most open source analysis tools are written for the command line. We are building on these proven tools to provide an interactive performance analysis experience within Jupyter Notebooks when developing parallel code with MPI, OpenMP, or both. Our solution makes it possible to measure the execution time, perform profiling and tracing, and visualize the results within the notebooks. For ease of use, it provides both a graphical JupyterLab extension and a C++ API. The JupyterLab extension shows a dialog where the user can select the type of analysis and its parameters. Internally, this tool uses Score -P, Scalasca, and Cube to generate profiling and tracing data. This tight integration gives students easy access to profiling tools and helps them better understand concepts such as benchmarking, scalability and performance bottlenecks. In addition to the technical development, the article presents hands-on exercises from our well-established parallel programming course. We conclude with a qualitative and quantitative evaluation with 19 students, which shows a positive effect of the tools on the students' perceived competence.
More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during...
详细信息
ISBN:
(纸本)9781665481069
More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated loss-less compression step-a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64x over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43x on average.
While parallel programming, particularly on graphics processing units (GPUs), and numerical optimization hold immense potential to tackle real-world computational challenges across disciplines, their inherent complexi...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
While parallel programming, particularly on graphics processing units (GPUs), and numerical optimization hold immense potential to tackle real-world computational challenges across disciplines, their inherent complexity and technical demands often act as daunting barriers to entry. This, unfortunately, limits accessibility and diversity within these crucial areas of computer science. To combat this challenge and ignite excitement among undergraduate learners, we developed an application-driven course, harnessing robotics as a lens to demystify the intricacies of these topics making them tangible and engaging. Our course's prerequisites are limited to the required undergraduate introductory core curriculum, opening doors for a wider range of students. Our course also features a large final-project component to connect theoretical learning to applied practice. In our first offering of the course we attracted 27 students without prior experience in these topics and found that an overwhelming majority of the students fell that they learned both technical and soft skills such that they felt prepared for future study in these fields.
Due to the short decohorence time of qubits available in the NISQ-era, it is essential to pack (minimize the size and or the depth of) a logical quantum circuit as efficiently as possible given a sparsely coupled phys...
详细信息
ISBN:
(纸本)9781665497473
Due to the short decohorence time of qubits available in the NISQ-era, it is essential to pack (minimize the size and or the depth of) a logical quantum circuit as efficiently as possible given a sparsely coupled physical architecture. In this work we introduce a localityaware qubit routing algorithm based on a graph theoretic framework. Our algorithm is designed for the grid and certain "grid-like" architectures. We experimentally show the competitiveness of algorithm by comparing it against the approximate token swapping algorithm, which is used as a primitive in many state-of-the-art quantum transpilers. Our algorithm produces circuits of comparable depth (better on random permutations) while being an order of magnitude faster than a typical implementation of the approximate token swapping algorithm.
Graphic processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel op...
详细信息
The hash table finds numerous applications in many different domains, but its potential for non-coalesced memory accesses and execution divergence characteristics impose optimization challenges on GPUs. We propose a n...
详细信息
ISBN:
(数字)9781665488020
ISBN:
(纸本)9781665488020
The hash table finds numerous applications in many different domains, but its potential for non-coalesced memory accesses and execution divergence characteristics impose optimization challenges on GPUs. We propose a novel hash table design, referred to as Cuckoo Node Hashing, which aims to better exploit the massive data parallelism offered by GPUs. At the core of its design, we leverage Cuckoo Hashing, one of known hash table design schemes, in a closed-address manner, which, to our knowledge, is the first attempt on GPUs. We also propose an architecture-aware warp-cooperative reordering algorithm that improves the memory performance and thread divergence of Cuckoo Node Hashing and efficiently increases the likelihood of coalesced memory accesses in hash table operations. Our experiments show that Cuckoo Node Hashing outperforms and scales better than existing state-of-the-art GPU hash table designs such as DACHash and Slab Hash with a peak performance of 5.03 billion queries/second in static searching and 434 billion insertions/second in static building.
暂无评论