The proceedings contain 38 papers. The special focus in this conference is on Automated Technology for Verification and Analysis. The topics include: Model Checking Strategies from Synthesis over Finite Traces;re...
ISBN:
(纸本)9783031453281
The proceedings contain 38 papers. The special focus in this conference is on Automated Technology for Verification and Analysis. The topics include: Model Checking Strategies from Synthesis over Finite Traces;reactive Synthesis of Smart Contract Control Flows;synthesis of distributed Protocols by Enumeration Modulo Isomorphisms;controller Synthesis for Reactive systems with Communication Delay by Formula Translation;statistical Approach to Efficient and Deterministic Schedule Synthesis for Cyber-Physical systems;compositional High-Quality Synthesis;learning Provably Stabilizing Neural Controllers for Discrete-Time Stochastic systems;an Automata-Theoretic Approach to Synthesizing Binarized Neural Networks;syntactic vs Semantic Linear Abstraction and Refinement of Neural Networks;learning Nonlinear Hybrid Automata from Input–Output Time-Series Data;using Counterexamples to Improve Robustness Verification in Neural Networks;a Novel Family of Finite Automata for Recognizing and Learning ω -Regular Languages;on the Containment Problem for Deterministic Multicounter Machine Models;parallel and Incremental Verification of Hybrid Automata with Ray and Verse;an Automata Theoretic Characterization of Weighted First-Order Logic;Graph-Based Reductions for Parametric and Weighted MDPs;scenario Approach for Parametric Markov Models;Fast Verified SCCs for Probabilistic Model Checking.
The exponential growth of the training dataset and the size of the large language model (LLM) significantly outpaces the incremental memory capacity increase in the graphics pro-cessing units (GPUs). Thousands of GPUs...
详细信息
ISBN:
(纸本)9798350376388
The exponential growth of the training dataset and the size of the large language model (LLM) significantly outpaces the incremental memory capacity increase in the graphics pro-cessing units (GPUs). Thousands of GPUs are needed to handle state-of-the-art models, which require building an expensive AI GPU cluster that is out of reach for most researchers. This not only makes the cost to train the model more costly but also signifies the environmental impact. To improve the efficiency and scalability of existing infrastructure to handle increasingly demanding training tasks, Microsoft released DeepSpeed, an open-source optimization library for PyTorch that can easily be integrated into existing training flow with minimal code changes. This paper presents a comprehensive third-party evaluation of DeepSpeed for training GPT-2-like LLM on mainstream GPU clusters that are more accessible to everyone. The evaluation includes memory usage analysis and bandwidth characterization in addition to the achieved model size and the attained compute throughput to help compare horizontal and vertical scaling. First, we examine the DeepSpeed ZeRO in single- and dual-node training against the popular distributed training libraries: PyTorch distributed Data-parallel (DDP) with data parallelism and Megatron-LM with data and model parallelism. While DDP achieves higher throughput due to less communication, the model size is limited to a single GPU memory capacity. In single-node training, Megatron-LM can fit a 4x larger model than the DDP, while ZeRO can handle a model with 0.8x-l.2x size of the Megatron-LM. Both Megatron-LM and ZeRO are reasonably competitive in terms of throughput. However, in dual-node training, Megatron- Lmsees a significant drop in throughput due to the excessive inter-node communication, achieving only 25 %-30 % of the throughput offered by ZeRO. Secondly, we evaluate ZeRO-Offload to consolidate multi-node training into single-node. With CPU offloading, ZeRO-Offloa
The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models...
详细信息
ISBN:
(数字)9781665462723
ISBN:
(纸本)9781665462723
The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. This paper presents a computation offload interface for multi-core systems augmented with distributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data execution. We demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement;speedup;data movement reduction) of (3.3;1.59;2.4)x, (2.46;1.43;3.5)x and (1.46;1.65;1.48)x compared to an out-of-order processor, monolithic accelerator with centralized accesses and monolithic accelerator with decentralized accesses, respectively. Evaluating both lightweight core and CGRA fabric implementations highlights model flexibility and quantifies the benefits of compute specialization for energy efficiency and speedup at 1.23x and 1.43x, respectively.
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can prov...
详细信息
ISBN:
(纸本)9781450391993
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in maintaining mission-critical performance in a large-scale, multi-user, parallel storage system. Center-wide I/O traces can provide high-level information and fine-grained activities per application or per user running in the system. Studying such large-scale traces can provide helpful insights into the system. It can be used to develop predictive methods for making predictive decisions, adjusting scheduling policies, or providing decisions for the design of next-generation systems. However, sharing real-world I/O traces to expedite such research efforts leaves a few concerns;i) the cost of sharing the large traces is expensive due to this large size, and ii) privacy concern is an issue. We address such issues by building an end-to-end machine learning (ML) workflow that can generate I/O traces for large-scale HPC applications. We leverage ML based feature selection and generative models for I/O trace generation. The generative models are trained on I/O traces collected by the darshan I/O characterization tool over a period of one year. We present a two-step generation process consisting of two deep-learning models, called the feature generator and the trace generator. The combination of two-step generative models provides robustness by reducing the bias of the model and accounting for the stochastic nature of the I/O traces across different runs of an application. We evaluate the performance of the generative models and show that the two-step model can generate time-series I/O traces with less than 20% root mean square error.
Remote Direct Memory Access (RDMA) hardware has bridged the gap between network and main memory speed and thus invalidated the common assumption that network is often the bottleneck in distributed data processing syst...
详细信息
ISBN:
(纸本)9781450392495
Remote Direct Memory Access (RDMA) hardware has bridged the gap between network and main memory speed and thus invalidated the common assumption that network is often the bottleneck in distributed data processing systems. However, high-speed networks do not provide "plug-and-play" performance (e.g., using IP-over-InfiniBand) and require a careful co-design of system and application logic. As a result, system designers need to rethink the architecture of their data management systems to benefit from RDMA acceleration. In this paper, we focus on the acceleration of stream processing engines, which is challenged by real-time constraints and state consistency guarantees. To this end, we propose Slash, a novel stream processing engine that uses high-speed networks and RDMA to efficiently execute distributed streaming computations. Slash embraces a processing model suited for RDMA acceleration and scales out by omitting the expensive data re-partitioning demands of scale-out SPEs. While scale-out SPEs rely on data re-partitioning to execute a query over many nodes, Slash uses RDMA to share mutable state among nodes. Overall, Slash achieves a throughput improvement up to two orders of magnitude over existing systems deployed on an InfiniBand network. Furthermore, it is up to a factor of 22 faster than a self-developed solution that relies on RDMA-based data repartitioning to scale out query processing.
The proteomics data analysis pipeline based on the shotgun method requires efficient data processing methods. The parallel algorithm of mass spectrometry database search faces the problems of rapidly expanding databas...
详细信息
This paper presents a dense linear algebra library for distributed memory systems called OMPC PLASMA. It leverages the OpenMP Cluster (OMPC) programming model to enable the execution of the PLASMA library using task p...
详细信息
ISBN:
(数字)9781665451574
ISBN:
(纸本)9781665451574
This paper presents a dense linear algebra library for distributed memory systems called OMPC PLASMA. It leverages the OpenMP Cluster (OMPC) programming model to enable the execution of the PLASMA library using task parallelism on a distributed cluster architecture. OpenMP Cluster model is used to define the task regions that are then distribute across the cluster nodes by the OMPC runtime that automatically manages task scheduling, communications between nodes, and fault tolerance. The OMPC PLASMA library modifies various PLASMA functions to distribute the matrix across the nodes and perform the calculation using threads of the node. Experimental results show that OMPC PLASMA achieves 4.00x with 4 worker nodes, 7.00x with 8 worker nodes, and 12.00x with 16 worker nodes acceleration over its original implementation for a single node. A 3.00x speedup is achieved when comparing OMPC PLASMA execution to ScaLAPACK, for 4 worker nodes, and a matrix size of 90kx90k.
This experimental work examines data movement in molecular dynamics (MD) workflows, comparing the Dynamic and Asynchronous Data Streamliner (DYAD) middleware with traditional, industry-standard I/O systems such as XFS...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
This experimental work examines data movement in molecular dynamics (MD) workflows, comparing the Dynamic and Asynchronous Data Streamliner (DYAD) middleware with traditional, industry-standard I/O systems such as XFS and Lustre. DYAD moves MD simulation frames to analytics processes, providing enhanced flexibility and efficiency for dynamic data transfers and in situ analytics. At the same time, traditional I/O storage systems provide durability and scalability for high-performance computing (HPC) systems. The study integrates MD workflows with common simulation codes, facilitating immediate capture and transfer of MD frames to a staging area. It explores various molecular models, from simple to complex, assessing data management performance and scalability. Different producer-consumer pairs, molecular models, and data transaction frequency enable testing across small to large-scale HPC scenarios, from single-node configurations to large, distributed environments. The findings reveal that adaptive mechanisms for minimizing synchronization, direct network communication between producer and consumer processes, and optimizations of both data movement and synchronization are crucial for performance and scalability in MD workflows.
The InterPlanetary File System (IPFS) is on its way to becoming the backbone of the next generation of the web. However, it suffers from several performance bottlenecks, particularly on the content retrieval path, whi...
详细信息
ISBN:
(数字)9798350349658
ISBN:
(纸本)9798350349665
The InterPlanetary File System (IPFS) is on its way to becoming the backbone of the next generation of the web. However, it suffers from several performance bottlenecks, particularly on the content retrieval path, which are often difficult to debug. This is because content retrieval involves multiple peers on the decentralized network and the issue could lie anywhere in the network. Traditional debugging tools are insufficient to help web developers who face the challenge of slow loading websites and detrimental user experience. This limits the adoption and future scalability of IPFS. In this paper, we aim to gain valuable insights into how content retrieval requests propagate within the IPFS network as well as identify potential performance bottlenecks which could lead to opportunities for improvement. We propose a custom tracing framework that generates and manages traces for crucial events that take place on each peer during content retrieval. The framework leverages event semantics to build a timeline of each protocol involved in the retrieval, helping developers pinpoint problems. Additionally, it is resilient to malicious behaviors of the peers in the decentralized environment. We have implemented this framework on top of an existing IPFS implementation written in Java called Nabu. Our evaluation shows that the framework can identify network delays and issues with each peer involved in content retrieval requests at a very low overhead.
With the emergence of social networks, online platforms dedicated to different use cases, and sensor networks, the emergence of large-scale graph community detection has become a steady field of research with real-wor...
详细信息
ISBN:
(数字)9798350369199
ISBN:
(纸本)9798350369205
With the emergence of social networks, online platforms dedicated to different use cases, and sensor networks, the emergence of large-scale graph community detection has become a steady field of research with real-world applications. Community detection algorithms have numerous practical applications, particularly due to their scalability with data size. Nonetheless, a notable drawback of community detection algorithms is their computational intensity [2], resulting in decreasing performance as data size increases. For this purpose, new frameworks that employ distributedsystems such as Apache Hadoop and Apache Spark which can seamlessly handle large-scale graphs must be developed. In this paper, we propose a novel framework for community detection algorithms, i.e., K-Cliques, Louvain, and Fast Greedy, developed using Apache Spark GraphFrames. We test their performance and scalability on two real-world datasets. The experimental results prove the feasibility of developing graph mining algorithms using Apache Spark GraphFrames.
暂无评论