applicationsthat fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and infere...
详细信息
ISBN:
(纸本)9798350311990
applicationsthat fuse machine learning and simulation can benefit from the use of multiple computing resources, with, for example, simulation codes running on highly parallel supercomputers and AI training and inference tasks on specialized accelerators. Here, we present our experiences deploying two AI-guided simulation workflows across such heterogeneous systems. A unique aspect of our approach is our use of cloud-hosted management services to manage challenging aspects of cross-resource authentication and authorization, function-as-a-service (FaaS) function invocation, and data transfer. We show that these methods can achieve performance parity with systems that rely on direct connection between resources. We achieve parity by integrating the FaaS system and data transfer capabilities with a system that passes data by reference among managers and workers, and a user-configurable steering algorithm to hide data transfer latencies. We anticipate that this ease of use can enable routine use of heterogeneous resources in computational science.
Multi-tenancy in public clouds may lead to colocation interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and...
详细信息
ISBN:
(纸本)9798350337662
Multi-tenancy in public clouds may lead to colocation interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black boxes to providers, where application-level performance information cannot be acquired. this makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters. We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth's interpretability, providing insights to benefit the decisionmaking of cloud operators. the dataset and code of Alioth have been released on Github.
Transformers have become keystone models in natural language processing over the past decade. they have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces require...
详细信息
ISBN:
(纸本)9798350337662
Transformers have become keystone models in natural language processing over the past decade. they have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for transformer functional modules, especially the performancecritical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms PyTorch by 6.13x. the end-to-end performance of ByteTransformer for a forward BERT transformer surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, by 87%, 131%, 138%, 74% and 55%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa.
this paper experimentally investigates the performance of a proof-of-concept self-power Li-Fi System based on solar cell for future Internet of things (IoT) applications. the proposed system is capable to provide a lo...
详细信息
ISBN:
(纸本)9798350348750;9798350348743
this paper experimentally investigates the performance of a proof-of-concept self-power Li-Fi System based on solar cell for future Internet of things (IoT) applications. the proposed system is capable to provide a low bandwidth connectivity and wireless energy harvesting simultaneously which consists of multiple input and multiple output (MIMO) LiFi transceiver. Different configurations of MIMO LiFi based on solar cell receiver are used in series and parallel and tested experimentally to evaluate its bandwidth and harvested power. Experimental results show that the series combination of solar cells in 4x4 achieves the higher bandwidth, B= 71.97 KHz due to better accumulation of signal to noise ratio (SNR). the larger configurations, 4x4 when connected in series achieves the higher electrical power harvested of 80 mW than 65 mW in parallel combination. this harvested power could be stepped up and stored. Furthermore, for the communication performance, an on-off keying (OOK)- non return to zero (NRZ) modulation is implemented and tested. the results show that using a SISO system, a data rate of 50 Kb/s is achieved at BER= 5x10(-3), however, the data rate is doubled to 100 Kb/s at BER= 2.8x10(-3) using a 4x4 MIMO in series due to higher SNR and improved bandwidth. the results could be further justified withthe received signal eye-diagrams and histograms.
Coarse-Grained Reconfigurable Architectures (CGRAs) emerged about 30 years ago. the very first CGRAs were programmed manually. Fortunately, some compilation approaches appeared rapidly to automate the mapping process....
详细信息
ISBN:
(纸本)9781665497473
Coarse-Grained Reconfigurable Architectures (CGRAs) emerged about 30 years ago. the very first CGRAs were programmed manually. Fortunately, some compilation approaches appeared rapidly to automate the mapping process. Numerous surveys on these architectures exist. Other surveys also gather the tools and methods, but none of them focuses on the mapping process only. this paper focuses solely on automated methods and techniques for mapping applications on CGRA and covers the last two decades of research. this paper aims at providing the terminology, the problem formulation, and a classification of existing methods. the paper ends with research challenges and trends for the future.
Cumulative performance profiling is a fast and lightweight method for gaining summary information about where and how communication time in parallel MPI applications is spent. MPI provides mechanisms for implementing ...
详细信息
ISBN:
(纸本)9781665497473
Cumulative performance profiling is a fast and lightweight method for gaining summary information about where and how communication time in parallel MPI applications is spent. MPI provides mechanisms for implementing such profilers that can be transparently used withapplications. Existing profilers typically profile on a process basis and record the frequency, total time, and volume of MPI operations per process. this can lead to grossly misleading cumulative information for applicationsthat make use of MPI features for partitioning the processes into different communicators. We present a novel MPI profiler, mpisee, for communicator-centric profiling that separates and records collective and pointto-point communication information per communicator in the application. We discuss the implementation of mpisee which makes significant use of the MPI attribute mechanism. We evaluate our tool by measuring its overhead and profiling a number of standard applications. Our measurements withthirteen MPI applications show that the overhead of mpisee is less than 3 %. Moreover, using mpisee, we investigate in detail two particular MPI applications, SPLATT and GROMACS, to obtain information on the various MPI operations for the different communicators of these applications. Such information is not available by other, state-of-the-art profilers. We use the communicator-centric information to improve the performance of SPLATT resulting in a significant runtime decrease when run with 1024 processes.
Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming i...
详细信息
ISBN:
(纸本)9781665497473
Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. Efficient communication is key to scaling applications on parallel systems, which is typically enabled by the Message Passing Interface (MPI) standard and compliant libraries on HPC hardware. mpi4py is a Python-based communication library that provides an MPI-like interface for Python applications allowing application developers to utilize parallelprocessing elements including GPUs. However, there is currently no benchmark suite to evaluate communication performance of mpi4py-and Python MPI codes in general-on modern HPC systems. In order to bridge this gap, we propose OMB-Py-Python extensions to the open-source OSU Micro-Benchmark (OMB) suite-aimed to evaluate communication performance of MPI-based parallelapplications in Python. To the best of our knowledge, OMBPy is the first communication benchmark suite for parallel Python applications. OMB-Py consists of a variety of point-to-point and collective communication benchmark tests that are implemented for a range of popular Python libraries including NumPy, CuPy, Numba, and PyCUDA. Our evaluation reveals that mpi4py introduces a small overhead when compared to native MPI libraries. We plan to publicly release OMB-Py to benefit the Python HPC community.
Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, d...
详细信息
ISBN:
(纸本)9781665481069
Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct Memory Access (RDMA), which has higher bandwidth and lower latency than traditional kernel-based networking. To ensure high performance of RDMA networks, congestion control manages queue depth on switches, and historically focused on moderating queue depth to ensure small flows complete quickly. Unfortunately, one side-effect of many common decisions is that large flows are starved of bandwidth. this negatively impacts the flow completion time (FCT) of large, bandwidth-bound flows, which are integral to the performance of data-intensive applications. the FCT is particularly impacted at the tail, which is increasingly critical for predictable application performance. We identify the root causes of the poor performance for long flows and measure the impact. We then design mechanisms that improve long flow FCT without compromising small flow performance. Our evaluations show that these improvements reduce 99.9% tail FCT of long flows by over 2x.
While quantum computers enable significant performance improvements for certain classes of applications, building a well-defined programming model has been a pressing issue. In this paper, we address some of the key l...
详细信息
ISBN:
(纸本)9798350311990
While quantum computers enable significant performance improvements for certain classes of applications, building a well-defined programming model has been a pressing issue. In this paper, we address some of the key limitations to realizing a generic heterogeneous parallel programming model for quantumclassical heterogeneous platforms. We discuss our experience in enabling user-level multi-threading in QCOR [1] as well as challenges that need to be addressed for programming future quantum-classical systems. Specifically, we discuss our design and implementation of introducing C++-based parallel constructs to enable 1) parallel execution of a quantum kernel with std::thread and 2) asynchronous execution with std::async. To do so, we provide a detailed overview of the current implementation of the QCOR programming model and runtime, and discuss how we add 1) thread-safety to some of its user-facing API routines, and 2) increase parallelism in QCOR by removing data races that inhibit multi-threading so as to better utilize available computing resources. We also present preliminary performance results withthe Quantum++ [2] back end on a single-node Ryzen9 3900X machine that has 12 physical cores (24 hardware threads) with 128GB of RAM. the results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other each with 24threads (1.63x improvement). In addition, we observe the same trend when running two Shor's algorthm kernels in parallel (1.22x faster than executing the kernels one after the other). Furthermore, the parallel version is better in terms of strong scalability. We believe that our design, implementation, and results will open up an opportunity not only for 1) enabling quicker prototyping of parallel-aware quantum-classical algorithms on quantum circuit simulators in the short-term, but also for 2) realizing a generic parallel programming model for quantumclassical heterogeneous platforms in
Stealth addresses protect recipient identity privacy in blockchain systems by allowing a sender to derive a stealth address using the recipient's public key, withthe receiver deriving a corresponding one-time pri...
详细信息
暂无评论