While considerable research has been directed at automatic parallelization for shared-memory platforms, little progress has been made in automatic parallelization schemes for distributed-memory systems. We introduce a...
详细信息
Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a sig...
详细信息
ISBN:
(纸本)9798400706981
Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8% and 88.2% throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining-an idea inspired by the CPU instruction pipelining-thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (<19.6% in throughput) across various LLM sizes, from 13B to 175B. PipeLLM's source code is available at https://***/SJTU-IPADS/PipeLLM.
Deep neural networks (DNNs) require distributed training strategies to deal with large data sizes. TensorFlow is one of the most widely used frameworks that support distributed training. Among the TensorFlow training ...
详细信息
We study the performance and scalability of the adaptive geometric multigrid method with the recently developed restricted additive Vanka (RAV) smoother for the finite element solution of large-scale Stokes problems o...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
We study the performance and scalability of the adaptive geometric multigrid method with the recently developed restricted additive Vanka (RAV) smoother for the finite element solution of large-scale Stokes problems on distributed-memory clusters. A comparison of the RAV smoother and the classical multiplicative and additive Vanka smoothers is presented. We present three cache policies for the smoother operators that provide a balance between cached and on-the-fly computation and discuss their memory footprint and computational cost. It is shown that the restricted additive smoother with the most efficient cache policy has the smallest memory footprint and is computationally cheaper in comparison with the other smoothers and can, therefore, be used for large-scale problems even when the available main memory is constrained. We discuss the parallelization aspects of the smoother operators and show that the RAV operator can be replicated exactly in parallel with a very small communication overhead. We present strong and weak scaling of the GMG solver for 2D and 3D examples with up to roughly 540 million degrees of freedom on up to 2048 MPI processes. The GMG solver with the restricted additive smoother is shown to achieve rapid convergence rates and scale well in both the strong and weak scaling studies, making it an attractive choice for the solution of large-scale Stokes problems on HPC systems.
This research is devoted to a quantitative comparison of the performance of several parallel programming approaches and compares their computational performance. Comparison is performed for the Computational Dynamics ...
详细信息
Modern enterprises are facing a massive threat from Advanced Persistent Threats (APTs), which have risen to be one of the most dangerous challenges in recent years. Since system logs capture the complex causality depe...
详细信息
ISBN:
(纸本)9798350329223
Modern enterprises are facing a massive threat from Advanced Persistent Threats (APTs), which have risen to be one of the most dangerous challenges in recent years. Since system logs capture the complex causality dependencies between system entities, they have become the primary data source for countering APTs. However, as modern computer systems get more complicated, system logs can pile up in large quantities. Besides, APTs are sophisticated and persistent cyber attacks that can remain hidden in the target for a long time and constantly steal private data. System logs need to be collected and stored for a long duration to enable a complete analysis of APTs. Such a vast amount of log data is challenging for enterprises to store and manage. There are two mainstream solutions for reducing storage overhead. Data compression methods provide an intuitive idea. However, they are designed for general text and lack optimization for system logs. Another solution considers log reduction, which removes redundant system events recorded in system logs by predefined rules. Unfortunately, they are tailored for specific kinds of redundant information, resulting in limited applicability. Realizing that these two solutions adopt two distinct perspectives to reduce storage overhead, they are complementary. Data compression methods shrink the size of log data from their binary form. Log reduction starts from the semantic information of system logs and removes redundant information to reduce storage overhead. Combining both methods maximizes storage efficiency. In this paper, we propose a distributed storage system based on a hybrid compression scheme. To address the above deficiencies, we first identify and merge redundant system events by analyzing and tracing the information flow rather than based on rules. Then, we apply log parsing to preprocess log entries for further storage efficiency. Besides, we design a distributed architecture to optimize compression and eliminate repeated
The proceedings contain 77 papers. The special focus in this conference is on parallel Processing and Applied Mathematics. The topics include: Neural Nets with a Newton Conjugate Gradient Method on Mult...
ISBN:
(纸本)9783031304415
The proceedings contain 77 papers. The special focus in this conference is on parallel Processing and Applied Mathematics. The topics include: Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs;Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-parallel Applications;Cost and Performance Analysis of MPI-Based SaaS on the Private Cloud Infrastructure;building a Fine-Grained Analytical Performance Model for Complex Scientific Simulations;evaluation of Machine Learning Techniques for Predicting Run Times of Scientific Workflow Jobs;Smart Clustering of HPC Applications Using Similar Job Detection Methods;distributed Work Stealing in a Task-Based Dataflow Runtime;task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning;Shisha: Online Scheduling of CNN Pipelines on Heterogeneous Architectures;General Framework for Deriving Reproducible Krylov Subspace Algorithms: BiCGStab Case;proactive Task Offloading for Load Balancing in Iterative Applications;language Agnostic Approach for Unification of Implementation Variants for Different Computing Devices;high Performance Dataframes from parallel Processing Patterns;global Access to Legacy Data-Sets in Multi-cloud Applications with Onedata;MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms;Breaking Down the parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software;GPU-Based Molecular Dynamics of Turbulent Liquid Flows with OpenMM;a Novel parallel Approach for Modeling the Dynamics of Aerodynamically Interacting Particles in Turbulent Flows;reliable Energy Measurement on Heterogeneous systems–on–Chip Based Environments;distributed Objective Function Evaluation for Optimization of Radiation Therapy Treatment Plans;a Generalized parallel Prefix Sums Algorithm for Arbitrary Size Arrays;GPU4SNN: GPU-Based Acceleration for Spiking Neural Network Simulations;Ant System Inspired Heuristic Optimization of UAVs Depl
The proceedings contain 77 papers. The special focus in this conference is on parallel Processing and Applied Mathematics. The topics include: Neural Nets with a Newton Conjugate Gradient Method on Mult...
ISBN:
(纸本)9783031304446
The proceedings contain 77 papers. The special focus in this conference is on parallel Processing and Applied Mathematics. The topics include: Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs;Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-parallel Applications;Cost and Performance Analysis of MPI-Based SaaS on the Private Cloud Infrastructure;building a Fine-Grained Analytical Performance Model for Complex Scientific Simulations;evaluation of Machine Learning Techniques for Predicting Run Times of Scientific Workflow Jobs;Smart Clustering of HPC Applications Using Similar Job Detection Methods;distributed Work Stealing in a Task-Based Dataflow Runtime;task Scheduler for Heterogeneous Data Centres Based on Deep Reinforcement Learning;Shisha: Online Scheduling of CNN Pipelines on Heterogeneous Architectures;General Framework for Deriving Reproducible Krylov Subspace Algorithms: BiCGStab Case;proactive Task Offloading for Load Balancing in Iterative Applications;language Agnostic Approach for Unification of Implementation Variants for Different Computing Devices;high Performance Dataframes from parallel Processing Patterns;global Access to Legacy Data-Sets in Multi-cloud Applications with Onedata;MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms;Breaking Down the parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software;GPU-Based Molecular Dynamics of Turbulent Liquid Flows with OpenMM;a Novel parallel Approach for Modeling the Dynamics of Aerodynamically Interacting Particles in Turbulent Flows;reliable Energy Measurement on Heterogeneous systems–on–Chip Based Environments;distributed Objective Function Evaluation for Optimization of Radiation Therapy Treatment Plans;a Generalized parallel Prefix Sums Algorithm for Arbitrary Size Arrays;GPU4SNN: GPU-Based Acceleration for Spiking Neural Network Simulations;Ant System Inspired Heuristic Optimization of UAVs Depl
In this paper, we propose a new self-reconfiguration scheme for modular robots based on a metamodule design that allows to form a 3D porous structure. The porous structure enables a parallel flow of modules inside it ...
详细信息
ISBN:
(数字)9781665479271
ISBN:
(纸本)9781665479271
In this paper, we propose a new self-reconfiguration scheme for modular robots based on a metamodule design that allows to form a 3D porous structure. The porous structure enables a parallel flow of modules inside it without blocking. The meta-module can also be used to fill its internal volume with an additional number of modules allowing the structure to be compressible and expandable. Hence, it is a potential for improving the self-reconfiguration process. We first present the meta-module model and the porous structure built using it. Then, we describe an algorithm to self-reconfigure the structure from an initial shape to a given goal shape. We evaluated the algorithm in simulation on structures composed of up to 2,700 modules. We studied the performance in term of parallelism, showed that the number of communications is proportional to the number of motions and the execution time varies linearly with the diameter of the configuration.
Considering the highly penetration of distributed generators, the distribution system has been extensively studied in recent year. However, traditional reliability evaluation algorithms require huge computational reso...
详细信息
暂无评论