In service oriented high-performance computing (HPC) clusters, end users have various Quality of Service (QoS) requirements. Most of the existing research work focuses on quantitative QoS requirements, such as deadlin...
详细信息
Benefiting from the cutting-edge supercomputers that support extremely large-scale scientific simulations, climate research has advanced significantly over the past decades. However, new critical challenges have arise...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Benefiting from the cutting-edge supercomputers that support extremely large-scale scientific simulations, climate research has advanced significantly over the past decades. However, new critical challenges have arisen regarding efficiently storing and transferring large-scale climate data among distributed repositories and databases for post hoc analysis. In this paper, we develop CliZ, an efficient online error-controlled lossy compression method with optimized data prediction and encoding methods for climate datasets across various climate models. On the one hand, we explored how to take advantage of particular properties of the climate datasets (such as mask-map information, dimension permutation/fusion, and data periodicity pattern) to improve the data prediction accuracy. On the other hand, CliZ features a novel multi-Huffman encoding method, which can significantly improve the encoding efficiency. Therefore significantly improving compression ratios. We evaluated CliZ versus many other state-of-the-art error-controlled lossy compressors (including SZ3, ZFP, SPERR, and QoZ) based on multiple real-world climate datasets with different models. Experiments show that CliZ outperforms the second-best compressor (SZ3, SPERR, or QoZ1.1) on climate datasets by 20%-200% in compression ratio. CliZ can significantly reduce the data transfer cost between the two remote Globus endpoints by 32%-38%.
Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus...
详细信息
ISBN:
(纸本)9798350337662
Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus on the scheduling of the important class of fork-join task graphs (describing many types of common computations) on homogeneous processors. For this sub-case, we propose a guaranteed algorithm with a (1+ m m-1)approximation factor, where m is the number of processors. The algorithm is not only the first constant approximation for an important sub-domain of the classic scheduling problem, it is also a practical algorithm that can obtain shorter makespans than known heuristics. To demonstrate this, we propose adaptations of known scheduling heuristic for the specific fork-join structure. In an extensive evaluation, we then implemented these algorithms and scheduled many fork-join graphs with up to thousands of tasks and various computation time distributions on up to hundreds of processors. Comparing the obtained results demonstrates the competitive nature of the proposed approximation algorithm.
Fully Sharded Data parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the mode...
详细信息
ISBN:
(纸本)9798350337662
Fully Sharded Data parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPUaware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing pointto-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.
Coarse-Grained Reconfigurable Architectures (CGRAs) emerged about 30 years ago. The very first CGRAs were programmed manually. Fortunately, some compilation approaches appeared rapidly to automate the mapping process....
详细信息
ISBN:
(纸本)9781665497473
Coarse-Grained Reconfigurable Architectures (CGRAs) emerged about 30 years ago. The very first CGRAs were programmed manually. Fortunately, some compilation approaches appeared rapidly to automate the mapping process. Numerous surveys on these architectures exist. Other surveys also gather the tools and methods, but none of them focuses on the mapping process only. This paper focuses solely on automated methods and techniques for mapping applications on CGRA and covers the last two decades of research. This paper aims at providing the terminology, the problem formulation, and a classification of existing methods. The paper ends with research challenges and trends for the future.
The intricate properties and relevance of graph data make it difficult to collect graph statistics privately via differential privacy (DP). Traditional centralized or local DP on graph data, face challenges like third...
详细信息
Emerging technologies, such as cloud computing and artificial intelligence, significantly arouse concern about data security and privacy. Homomorphic encryption (HE) is a promising invention, which enables computation...
详细信息
Data stores utilized in modern data-intensive applications are expected to demonstrate rapid read and write capabilities and robust fault tolerance. Byzantine fault-tolerant database (BFT database) can execute transac...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Data stores utilized in modern data-intensive applications are expected to demonstrate rapid read and write capabilities and robust fault tolerance. Byzantine fault-tolerant database (BFT database) can execute transactions concurrently and tolerate arbitrary faults (Byzantine fault). We consider cryptographic and communication processing as performance bottlenecks in the transaction processing of BFT databases. This paper presents a transaction reconstruction method, reconstructing a single transaction from multiple transactions to streamline cryptographic and communication processes. We evaluated the proposed method with Basil (state-of-the-art BFT database) in experiments. In an environment where nodes are geographically centralized, the proposed method demonstrates up to approximately 2.5 times higher throughput and reduces latency by up to about 30% than vanilla Basil. In an environment where nodes are geographically distributed, the proposed method demonstrates up to approximately 50 times higher throughput and reduces latency by up to about 75% than vanilla Basil.
Streaming applications are expected to process an ever-increasing amount of data with high throughput and stringent latency requirements. Flooding these applications with incoming data may overload the stream processi...
详细信息
ISBN:
(纸本)9798350395679;9798350395662
Streaming applications are expected to process an ever-increasing amount of data with high throughput and stringent latency requirements. Flooding these applications with incoming data may overload the stream processing engine, leading to a system with unstable queues and infinitely growing latencies. Existing stream processing systems are equipped to deal with such overload scenarios reactively, either through back pressure or load shedding mechanisms. These mechanisms, however, have considerable drawbacks as they consume additional system resources, incur in non-negligible performance overheads, and may compromise the quality of application-level results. To address this gap, we propose a strategy based on reinforcement learning to throttle the input rate of data sources in streaming applications. The proposed strategy mitigates overload scenarios by addressing the source of the problem, thus allowing resources to be better utilized by application and system components and mitigating the performance overhead of system-level reactive mechanisms. Through our experiments with two different applications, we demonstrate that our proposed approach reduces end-to-end latencies by up to 82% and increases throughput by up to 10% compared to back pressure mechanisms implemented in state-of-the-art stream processing engines.
暂无评论