As we enter the exascale era, the energy efficiency and performance of High-Performance computing (HPC) systems, especially running Machine Learning (ML) applications, are becoming increasingly important. Nvidia recen...
详细信息
ISBN:
(纸本)9798350311990
As we enter the exascale era, the energy efficiency and performance of High-Performance computing (HPC) systems, especially running Machine Learning (ML) applications, are becoming increasingly important. Nvidia recently released its 9th-generation HPC-grade Graphics Processing Unit (GPU) microarchitecture, Ampere, claiming significant improvements over the previous generation's Volta architecture. In this paper, we perform fine-grained power collection and assess the performance of these two HPC architectures' performance by profiling ML benchmarks. In addition, we analyze various hyperparameters, primarily the batch size and the number of GPUs, to determine their impact on these systems' performance and power efficiency. While Ampere is 3.16x more energy-efficient than Volta in isolation, this is counteracted by the PCIe interconnects of the A100s as the ML tasks are parallelized to run on more GPUs.
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributed arithmetic (DA) has been frequently employed for area-t...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributed arithmetic (DA) has been frequently employed for area-time efficient inner-product implementations. In conventional DA-based architectures, one of the vectors is constant and known a priori. Hence, the traditional DA architectures are not suitable when both vectors are variable. However, computing the inner product of a pair of variable vectors is frequently used for matrix multiplication of various forms and convolutional neural networks. In this paper, we present a novel DA-based architecture for computing the inner product of variable vectors. To derive the proposed architecture, the inner product of any given length is decomposed into a set of short-length inner products, such that the inner product could be computed by successive accumulation of the results of shortlength inner products. We have designed a DA-based architecture for the computation of the short-length inner-product of variable vectors and used that in successive clock cycles to compute the whole inner-product by successive accumulation. The post-layout synthesis results using Cadence Innovus with a GPDK 90nm technology library show that the proposed DA-based parallel architecture offers significant advantages in area-delay product and energy consumption over the bit-serial DA architecture.
Interconnection network in HPC is becoming a bottleneck due to increasing traffic load. We model adaptive routing mechanisms and prove that even with advanced adaptive routing, static networks like Dragonfly cannot ha...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Interconnection network in HPC is becoming a bottleneck due to increasing traffic load. We model adaptive routing mechanisms and prove that even with advanced adaptive routing, static networks like Dragonfly cannot handle non-uniform traffic efficiently, let alone the frequently changing non-uniform traffic. Therefore, it requires architectural changes for network-wide improvements, e.g., reconfigurable networks. Existing reconfigurable networks hardly support agile reaction to traffic changes with little impact on network. Therefore, we propose MUSE1, a Dragonfly-based runtime incrementally reconfigurable network to enable a small number of link adjustments for agility and little impact on transmitting flows during every reconfiguration with optical circuit switch (OCS). Simulations with both synthetic traffic and real-world workloads prove that MUSE can prevent saturation under typical traffic patterns that cause congestion in static Dragonfly. MUSE is 30-55% better than static Dragonfly and Flexfly w.r.t commonly used performance metrics like flow completion time (FCT). We also build a MUSE prototype and demonstrate that MUSE enables 20-30% less application finish time (AFT).
Simultaneous multithreading processors improve throughput over single-threaded processors thanks to sharing internal core resources among instructions from distinct threads. However, resource sharing introduces inter-...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Simultaneous multithreading processors improve throughput over single-threaded processors thanks to sharing internal core resources among instructions from distinct threads. However, resource sharing introduces inter-thread interference within the core, which has a negative impact on individual application performance and can significantly increase the turnaround time of multi-program workloads. The severity of the interference effects depends on the competing co-runners sharing the core. Thus, it can be mitigated by applying a thread-to-core allocation policy that smartly selects applications to be run in the same core to minimize their interference. This paper presents SYNPA, a simple approach that dynamically allocates threads to cores in an SMT processor based on their run-time dynamic behavior. The approach uses a regression model to select synergistic pairs to mitigate intra-core interference. The main novelty of SYNPA is that it uses just three variables collected from the performance counters available in current ARM processors at the dispatch stage. Experimental results show that SYNPA outperforms the default Linux scheduler by around 36%, on average, in terms of turnaround time in 8-application workloads combining frontend-bound and backend-bound benchmarks.
Finding the maximum cut of a graph (MAXCUT) is a classic optimization problem that has motivated parallel algorithm development. While approximate algorithms to MAXCUT offer attractive theoretical guarantees and demon...
详细信息
ISBN:
(纸本)9798350337662
Finding the maximum cut of a graph (MAXCUT) is a classic optimization problem that has motivated parallel algorithm development. While approximate algorithms to MAXCUT offer attractive theoretical guarantees and demonstrate compelling empirical performance, such approximation approaches can shift the dominant computational cost to the stochastic sampling operations. Neuromorphic computing, which uses the organizing principles of the nervous system to inspire new parallelcomputing architectures, offers a possible solution. One ubiquitous feature of natural brains is stochasticity: the individual elements of biological neural networks possess an intrinsic randomness that serves as a resource enabling their unique computational capacities. By designing circuits and algorithms that make use of randomness similarly to natural brains, we hypothesize that the intrinsic randomness in microelectronics devices could be turned into a valuable component of a neuromorphic architecture enabling more efficient computations. Here, we present neuromorphic circuits that transform the stochastic behavior of a pool of random devices into useful correlations that drive stochastic solutions to MAXCUT. We show that these circuits perform favorably in comparison to software solvers and argue that this neuromorphic hardware implementation provides a path for scaling advantages. This work demonstrates the utility of combining neuromorphic principles with intrinsic randomness as a computational resource for new computational architectures.
NVIDIA's H100 Confidential computing (CC) counters the security hazards inherent in cloud AI workloads. It enforces data encryption to achieve data confidentiality, which leads to substantial throughput reductions...
详细信息
High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive ...
详细信息
ISBN:
(纸本)9798350337662
High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive costs for highperformance applications. In this work, we investigate how modern cloud programming paradigms can bring the elasticity needed to allocate idle resources, decreasing computation costs and improving overall data center efficiency. Function-as-aService (FaaS) brings the pay-as-you-go execution of stateless functions, but its performance characteristics cannot match coarse-grained cloud and cluster allocations. To make serverless computing viable for high-performance and latency-sensitive applications, we present rFaaS, an RDMA-accelerated FaaS platform. We identify critical limitations of serverless - centralized scheduling and inefficient network transport - and improve the FaaS architecture with allocation leases and microsecond invocations. We show that our remote functions add only negligible overhead on top of the fastest available networks, and we decrease the execution latency by orders of magnitude compared to contemporary FaaS systems. Furthermore, we demonstrate the performance of rFaaS by evaluating real-world FaaS benchmarks and parallel applications. Overall, our results show that new allocation policies and remote memory access help FaaS applications achieve high performance and bring serverless computing to HPC.
As the burst buffer is being widely deployed in the HPC (High-Performance computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of pa...
详细信息
ISBN:
(纸本)9798350337662
As the burst buffer is being widely deployed in the HPC (High-Performance computing) systems, the distributed file system layer is taking the role of campaign storage where scalability and cost-effectiveness are of paramount importance. However, the centralized metadata management in the distributed file system layer poses a scalability challenge. The object storage system has emerged as an alternative thanks to its simplified interface and scale-out architecture. Despite this, the HPC communities are used to working with the POSIX interface to organize their files into a global directory hierarchy and control access through access control lists. In this paper, we present ArkFS, a near-POSIX compliant and scalable distributed file system implemented on top of the object storage system. ArkFS achieves high scalability without any centralized metadata servers. Instead, ArkFS lets each client manage a portion of the file system metadata on a perdirectory basis. ArkFS supports any distributed object storage system such as Ceph RADOS or S3-compatible system with an appropriate API translation module. Our experimental results indicate that ArkFS shows significant performance improvement under metadata-intensive workloads while showing near-linear scalability. We also demonstrate that ArkFS is suitable for handling the bursty I/O traffic coming from the burst buffer layer to archive cold data.
As part of a larger effort, this work-in-progress reports the possible advantages of modifying conventional workflows used to generate labelled training samples and train machine learning (ML) models on them. We compa...
详细信息
ISBN:
(纸本)9798350311990
As part of a larger effort, this work-in-progress reports the possible advantages of modifying conventional workflows used to generate labelled training samples and train machine learning (ML) models on them. We compare results from three different workflows using neutron scattering data analysis as the motivating application and report about 20% improvement in speedup, with no appreciable loss of model accuracy, over a baseline workflow.
The growing developments of HPC systems used in a plethora of domains (healthcare, financial services, government and defense, energy) triggers an urgent demand for simulation frameworks that can simulate, in an integ...
详细信息
ISBN:
(纸本)9798400704130
The growing developments of HPC systems used in a plethora of domains (healthcare, financial services, government and defense, energy) triggers an urgent demand for simulation frameworks that can simulate, in an integrated manner, both processing and network components of an HPC system-under-design (SuD). The main problem, however, is that, currently, there is a shortage of simulation frameworks that can handle the simulation of actual HPC systems, including the hardware, complete software stack and network dynamics in an integrated manner. In this work we start from the first known, open-source, fully-distributed Cloud simulation framework, COSSIM, and, as part of the RED-SEA1 and Vitamin-V2 European projects, we extend it so as to be able to accurately simulate HPC systems. The extended simulator has been evaluated when executing the very-widely used HPCG & LAMMPS benchmarks on both ARM & RISC-V architectures;the results demonstrate that the presented approach has up to 95% accuracy in the reported SuD aspects.
暂无评论