Benefiting from the cutting-edge supercomputers that support extremely large-scale scientific simulations, climate research has advanced significantly over the past decades. However, new critical challenges have arise...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Benefiting from the cutting-edge supercomputers that support extremely large-scale scientific simulations, climate research has advanced significantly over the past decades. However, new critical challenges have arisen regarding efficiently storing and transferring large-scale climate data among distributed repositories and databases for post hoc analysis. In this paper, we develop CliZ, an efficient online error-controlled lossy compression method with optimized data prediction and encoding methods for climate datasets across various climate models. On the one hand, we explored how to take advantage of particular properties of the climate datasets (such as mask-map information, dimension permutation/fusion, and data periodicity pattern) to improve the data prediction accuracy. On the other hand, CliZ features a novel multi-Huffman encoding method, which can significantly improve the encoding efficiency. Therefore significantly improving compression ratios. We evaluated CliZ versus many other state-of-the-art error-controlled lossy compressors (including SZ3, ZFP, SPERR, and QoZ) based on multiple real-world climate datasets with different models. Experiments show that CliZ outperforms the second-best compressor (SZ3, SPERR, or QoZ1.1) on climate datasets by 20%-200% in compression ratio. CliZ can significantly reduce the data transfer cost between the two remote Globus endpoints by 32%-38%.
Log-based anomaly detection has been extensively studied to help detect complex runtime anomalies in production systems. However, existing techniques exhibit several common issues. First, they rely heavily on expert-l...
详细信息
ISBN:
(纸本)9798400701559
Log-based anomaly detection has been extensively studied to help detect complex runtime anomalies in production systems. However, existing techniques exhibit several common issues. First, they rely heavily on expert-labeled logs to discern anomalous behavior patterns. But labelling enough log data manually to effectively train deep neural networks may take too long. second, they rely on numeric model prediction based on numeric vector input which causes model decisions to be largely non-interpretable by humans which further rules out targeted error correction. In recent years, we have witnessed groundbreaking advancements in large language models (LLMs) such as ChatGPT. These models have proven their ability to retain context and formulate insightful responses over entire conversations. They also present the ability to conduct few-shot and in-context learning with reasoning ability. In light of these abilities, it is only natural to explore their applicability in understanding log content and conducting anomaly classification among parallel file system logs.
Trusted execution environment (TEE) promises strong security guarantee with hardware extensions for security-sensitive tasks. Due to its numerous benefits, TEE has gained widespread adoption, and extended from CPU-onl...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Trusted execution environment (TEE) promises strong security guarantee with hardware extensions for security-sensitive tasks. Due to its numerous benefits, TEE has gained widespread adoption, and extended from CPU-only TEEs to FPGA and GPU TEE systems. However, existing TEE systems exhibit inadequate and inefficient support for an emerging (and significant) processing unit, NPU. For instance, commercial TEE systems resort to coarse-grained and static protection approaches for NPUs, resulting in notable performance degradation (10%-20%), limited (or no) multitasking capabilities, and suboptimal resource utilization. In this paper, we present a secure NPU architecture, known as sNPU, which aims to mitigate vulnerabilities inherent to the design of NPU architectures. First, sNPU proposes NPU Guarder to enhance the NPU's access control. second, sNPU defines new attack surfaces leveraging in-NPU structures like scratchpad and NoC, and designs NPU Isolator to guarantee the isolation of scratchpad and NoC routing. Third, our system introduces a trusted software module called NPU Monitor to minimize the software TCB. Our prototype, evaluated on FPGA, demonstrates that sNPU significantly mitigates the runtime costs associated with security checking (from upto 20% to 0%) while incurring less than 1% resource costs.
Lattice-based Post-Quantum Cryptography (PQC) can effectively resist the quantum threat to blockchain's underlying cryptographic algorithms. Blockchain node decryption is one of the most commonly used cryptographi...
详细信息
Stateful serverless applications need to persist their state and data. The existing approach is to store the data in general purpose storage systems. However, these approaches are not designed to meet the demands of s...
详细信息
ISBN:
(纸本)9798400701559
Stateful serverless applications need to persist their state and data. The existing approach is to store the data in general purpose storage systems. However, these approaches are not designed to meet the demands of serverless applications in terms of consistency, fault tolerance and performance. We present FlexLog, a storage system, specifically a distributed sharedlog, distinctively designedtomeet the requirements of stateful serverless computing while mitigating the relevant system bottlenecks. FlexLog's data layer leverages the state-of-the-art persistent memory (PM) to offer low latency I/O and improve performance. To match the performance, FlexLog's ordering layer employs a scalable design, namely a tree-structure set of sequencer nodes. Importantly, this design provides serverless applications with the flexibility to implement different consistency guarantees and to seamlessly support multi-tenancy configurations. We implement FlexLog from the ground up on a real hardware testbed and we also prove the correctness of our protocols. In particular, we evaluate FlexLog on a cluster of 6 machines with 800 GB Intel Optane DC PM over a 10 Gbps interconnect. Our evaluation shows that FlexLog scales to millions of operations per second while maintaining minimal latency. Our comparison with the state-of-the-art shared log for serverless, Boki, shows that we achieve 10x better throughput in the storage layer and 2x-4x lower latency in the ordering layer, while also providing flexibility to support different consistency properties and multi-tenancy.
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI...
详细信息
ISBN:
(纸本)9781450386104
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.
This article describes the keynote speech on INODE presented at Fourth international Workshop on systems and Network Telemetry and Analytics (SNTA) which is collocated with international ACM symposium on High -Perform...
详细信息
ISBN:
(纸本)9781450383868
This article describes the keynote speech on INODE presented at Fourth international Workshop on systems and Network Telemetry and Analytics (SNTA) which is collocated with international ACM symposium on High -Performance parallel and distributed Computing (HPDC) on June 21 in Stockholm, Sweden.
Sustainable stream processing algorithms have gained popularity in recent years. Flow control is a way of searching and modifying real-time data streams. Missing values are ubiquitous in real-world data streams, makin...
详细信息
Dynamic graphs are widely used in real-world applications and exhibit structural skew, which leads to skewed updates. Existing systems that support transactional updates often fail to fully consider this phenomenon, s...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
Dynamic graphs are widely used in real-world applications and exhibit structural skew, which leads to skewed updates. Existing systems that support transactional updates often fail to fully consider this phenomenon, struggling to balance update and query performance. Inspired by the concepts of hierarchical thinking and uneven rebalancing on PMA, we propose DAUSK, a high-performance in-memory structure for dynamic graph storage that supports efficient graph analytics and rapid transactional updates. DAUSK addresses the skewed characteristics of graphs by employing two types of data structures: an unrolled skip list for frequently updated high-degree vertices, allowing asymptotically faster searches and updates, and a compact array-like structure called UPMA for medium and low-degree vertices, enabling efficient sequential scans. At the same time, DAUSK uses a vertex-centric strategy to partition UPMA and applies uneven gap allocation based on vertex degrees. Futhermore, considering the characteristics of transactions on graphs, DAUSK uses the well-established 2PL concurrency protocol to support millions of transactional updates per second. Experimental results demonstrate that DAUSK is up to 10.17×, 3.29× and 1.52× faster in ingesting graph updates compared to three state-of-the-art transactional graph systems, i.e., LiveGraph, Teseo and Sortledton. As for graph analytics, DAUSK achieves comparable, if not superior, performance compared to the other three graph systems.
暂无评论