In this article we present PARSIR (parallel SImulation Runner), a package that enables the effective exploitation of shared-memory multi-processor machines for running discrete event simulation models. PARSIR is a com...
详细信息
the emerging trend of the convergence of high performance computing (HPC), machine learning/deep learning (ML/DL), and big data analytics presents a host of challenges for large-scale computing campaigns that seek bes...
详细信息
ISBN:
(纸本)9781665497473
the emerging trend of the convergence of high performance computing (HPC), machine learning/deep learning (ML/DL), and big data analytics presents a host of challenges for large-scale computing campaigns that seek best practices to interleave traditional scientific simulation-based workloads with ML/DL models. A portfolio of systematic approaches to incorporate deep learning into modeling and simulation serves a vital need when we support AI for science at a computing facility. In this paper, we evaluate several strategies for deploying deep learning surrogate models in a representative physics application on supercomputers at the Oak Ridge Leadership Computing Facility (OLCF). We discuss a set of recommended deployment architectures and implementation approaches. We analyze and evaluate these alternatives and show their performance and scalability up to 1000 GPUs on two mainstream platforms equipped with different deep learning hardware and software stacks.
We develop a family of parallel algorithms for the SpKAdd operation that adds a collection of k sparse matrices. SpKAdd is a much needed operation in many applications including distributed memory sparse matrix-matrix...
详细信息
ISBN:
(纸本)9781665497473
We develop a family of parallel algorithms for the SpKAdd operation that adds a collection of k sparse matrices. SpKAdd is a much needed operation in many applications including distributed memory sparse matrix-matrix multiplication (SpGEMM), streaming accumulations of graphs, and algorithmic sparsification of the gradient updates in deep learning. While adding two sparse matrices is a common operation in Matlab, Python, Intel MKL, and various GraphBLAS libraries, these implementations do not perform well when adding a large collection of sparse matrices. We develop a series of algorithms using tree merging, heap, sparse accumulator, hash table, and sliding hash table data structures. Among them, hash-based algorithms attain the theoretical lower bounds both on the computational and I/O complexities and perform the best in practice. the newly-developed hash SpKAdd makes the computation of a distributed-memory SpGEMM algorithm at least 2x faster than previous state-of-the-art algorithms.
In this paper we present a performance study of multidimensional Fast Fourier Transforms (FFT) with GPU accelerators on modern hybrid architectures, as those expected for upcoming exascale systems. We assess and lever...
详细信息
ISBN:
(纸本)9781665497473
In this paper we present a performance study of multidimensional Fast Fourier Transforms (FFT) with GPU accelerators on modern hybrid architectures, as those expected for upcoming exascale systems. We assess and leverage features from traditional implementations of parallel FFTs and provide an algorithm that encompasses a wide range of their parameters, and adds novel developments such as FFT grid shrinking and batched transforms. Next, we create a bandwidth model to quantify the computational costs and analyze the well-known communication bottleneck for All-to-All and Point-to-Point MPI exchanges. then, using a tuning methodology, we are able to accelerate the FFT computation and reduce the communication cost, achieving linear scalability on a large-scale system with GPU accelerators. Finally, our performance analysis is extended to show that carefully tuning the algorithm can further accelerate applications heavily relying on FFTs, such is the case of molecular dynamics software. Our experiments were performed on Summit and Spock supercomputers with IBM Power9 cores, over 3000 NVIDIA V-100 GPUs, and AMD MI-100 GPUs.
processing-in-memory (PIM) is promising to solve the well-known data movement challenge by performing in-situ computations near the data. Leveraging PIM features is pretty profitable to boost the energy efficiency of ...
详细信息
ISBN:
(纸本)9781665481069
processing-in-memory (PIM) is promising to solve the well-known data movement challenge by performing in-situ computations near the data. Leveraging PIM features is pretty profitable to boost the energy efficiency of applications. Early studies mainly focus on improving the programmability for computation offloading on PIM architectures. they lack a comprehensive analysis of computation locality and hence fail to accelerate a wide variety of applications. In this paper, we present a general-purpose instruction-level offloading technique for near-DRAM PIM architectures, namely IOTPIM, to exploit PIM features comprehensively. IOTPIM is novel with two technical advances: 1) a new instruction offloading policy that fully considers the locality of the whole on-chip cache hierarchy, and 2) an offloading performance benefit prediction model that directly predicts offloading performance benefits of an instruction based on the input dataset characterizes, preserving low analysis overheads. the evaluation demonstrates that IOTPIM can be applied to accelerate a wide variety of applications, including graph processing, machine learning, and image processing. IOTPIM outperforms the state-of-the-art PIM offloading techniques by 1.28x-1.51x while ensuring offloading accuracy as high as 91.89% on average.
Genomic data leaks are irreversible. Leaked DNA cannot be changed, stays disclosed indefinitely, and affects the owner's family members as well. the recent large-scale genomic data collections [1], [2] render the ...
详细信息
ISBN:
(纸本)9781665497473
Genomic data leaks are irreversible. Leaked DNA cannot be changed, stays disclosed indefinitely, and affects the owner's family members as well. the recent large-scale genomic data collections [1], [2] render the traditional privacy protection mechanisms, like the Health Insurance Portability and Accountability Act (HIPAA), inadequate for protection against the novel security attacks [3]. On the other hand, data access restrictions hinder important clinical research that requires large datasets to operate [4]. these concerns can be naturally addressed by the employment of privacy-enhancing technologies, such as a secure multiparty computation (MPC) [5]–[10]. Secure MPC enables computation on data without disclosing the data itself by dividing the data and computation between multiple computing parties in a distributed manner to prevent individual computing parties from accessing raw data. MPC systems are being increasingly adopted in fields that operate on sensitive datasets [11]–[13], such as computational genomics and biomedical research [14]–[22].
Tree-based search algorithms applied to combinatorial optimization problems are highly irregular and timeconsuming when solving instances of NP-Hard problems. Due to their parallel nature, algorithms for this class of...
详细信息
ISBN:
(纸本)9781665497473
Tree-based search algorithms applied to combinatorial optimization problems are highly irregular and timeconsuming when solving instances of NP-Hard problems. Due to their parallel nature, algorithms for this class of complexity have been revisited for different architectures over the years. However, parallelization efforts have always been guided by the performance objective setting aside productivity. Using Chapel's high productivity for the design and implementation of distributed tree search algorithms keeps the programmer from lower-level details, such as communication and load balancing. However, the parameterization of such parallelapplications is complex, consisting of several parameters, even if a high-productivity language is used in their conception. this work presents a local searchbased heuristic for automatic parameterization of ChapelBB, a distributed tree search application for solving combinatorial optimization problems written in Chapel. the main objective of the proposed heuristic is to overcome the limitation of manual parameterization, which covers a limited feasible space. the reported results show that the heuristic-based parameterization increases up to 30% the performance of ChapelBB on 2048 cores (4096 threads) solving the N-Queens problem and up to 31% solving instances of the Flow-shop scheduling problem to the optimality.
Disaggregated architecture brings new opportunities to memory-consuming applications like graph processing. It allows one to outspread memory access pressure from local to far memory, providing an attractive alternati...
详细信息
ISBN:
(纸本)9781665481069
Disaggregated architecture brings new opportunities to memory-consuming applications like graph processing. It allows one to outspread memory access pressure from local to far memory, providing an attractive alternative to disk-based processing. Although existing works on general-purpose far memory platforms show great potentials for application expansion, it is unclear how graph processingapplications could benefit from disaggregated architecture, and how different optimization methods influence the overall performance. In this paper, we take the first step to analyze the impact of graph processing workload on disaggregated architecture by extending the GridGraph framework on top of the RDMA-based far memory system. We design Fargraph, a far memory coordination strategy for enhancing graph processing workload. Specifically, Fargraph reduces the overall data movement through a well-crafted, graph-aware data segment offloading mechanism. In addition, we use optimal data segment splitting and asynchronous data buffering to achieve graph iteration-friendly far memory access. We show that Fargraph achieves near-oracle performance for typical in-local-memory graph processing systems. Fargraph shows up to 8.3x speedup compared to Fastswap, the state-of-the-art, general-purpose far memory platform.
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this techn...
详细信息
ISBN:
(纸本)9781665497473
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this technique, both computations and communications are running at the same time. But computation usually also performs some data movements. Since data for computations and for communications use the same memory system, memory contention may occur when computations are memory-bound and large messages are transmitted through the network at the same time. In this paper we propose a model to predict memory band-width for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. the model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.
Hardware memory disaggregation is an emerging trend in datacenters that provides access to remote memory as part of a shared pool or unused memory on machines across the network. Memory disaggregation aims to improve ...
详细信息
ISBN:
(纸本)9781665497473
Hardware memory disaggregation is an emerging trend in datacenters that provides access to remote memory as part of a shared pool or unused memory on machines across the network. Memory disaggregation aims to improve memory utilization and scale memory-intensive applications. Current stateof-the-art prototypes have shown that hardware disaggregated memory is a reality at the rack-scale. However, the memory utilization benefits of memory disaggregation can only be fully realized at larger scales enabled by a datacenter-wide network. Introduction of a datacenter network results in new performance and reliability failures that may manifest as higher network latency. Additionally, sharing of the network introduces new points of contention between multiple applications. In this work, we characterize the impact of variable network latency and contention in an open-source hardware disaggregated memory prototype - thymesisFlow. To support our characterization, we have developed a delay injection framework that introduces delays in remote memory access to emulate network latency. Based on the characterization results, we develop insights into how reliability and resource allocation mechanisms should evolve to support hardware memory disaggregation beyond rack-scale in datacenters.
暂无评论