This paper presents techniques for theoretically and practically efficient and scalable Schrodinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simul...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
This paper presents techniques for theoretically and practically efficient and scalable Schrodinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2x on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.
High energy physics experiments produce petabytes of data annually that must be reduced to gain insight into the laws of nature. Early-stage reduction executes long-running, high-throughput workflows across thousands ...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
High energy physics experiments produce petabytes of data annually that must be reduced to gain insight into the laws of nature. Early-stage reduction executes long-running, high-throughput workflows across thousands of nodes spanning multiple facilities to produce shared datasets. Later stages are typically written by individuals or small groups and must be refined and re-run many times for correctness. Reducing iteration times of later stages is key to accelerating discovery. We demonstrate our experience reshaping late-stage analysis applications on thousands of nodes. It is not enough merely to increase scale: it is necessary to make changes throughout the stack, including storage systems, data management, task scheduling, and application design. We demonstrate these changes when applied to two analysis applications built on open source data analysis frameworks (Coffea, Dask, TaskVine). We evaluate the performance of the applications on opportunistic campus clusters, showing effective scaling up to 7200 cores, thus producing significant speedup.
Relational Hoare logic [18] extends the applicability of modular deductive verification to encompass the verification of crucial 2-run properties, such as confidentiality. Most of the current research on rel...
详细信息
HPC is a widely used term, often referred to the applications, architectures and programming models and tools targeting highly parallel machines such as those of the *** lists. Recent advances in computing hardware re...
详细信息
作者:
Marzolla, Moreno
Center for Inter-Department Industrial Research ICT Bologna Italy
Mini-applications are widely used in parallel computing for testing and benchmarking purposes. However, many existing mini-applications are not suitable for teaching, since they require advanced knowledge of algebra, ...
详细信息
Hardware Transactional Memory (HTM) offers the opportunity to ease parallel programming. However, driven by hardware limitations, commercial implementations eschew the complexity involved in early sophisticated propos...
详细信息
The US Department of Energy's fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of ...
详细信息
The US Department of Energy's fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program's structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21x. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19x.
Modern industrial robots can work alongside human workers and coordinate with other robots. This means they can perform complex tasks, but doing so requires complex programming. Therefore, robots are typically program...
详细信息
Modern industrial robots can work alongside human workers and coordinate with other robots. This means they can perform complex tasks, but doing so requires complex programming. Therefore, robots are typically programmed by experts, but there are not enough to meet the growing demand for robots. To reduce the need for experts, researchers have tried to make robot programming accessible to factory workers without programming experience. However, none of that previous work supports coordinating multiple robot arms that work on the same task. In this paper we present four block-based programming language designs that enable end-users to program two-armed robots. We analyze the benefits and trade-offs of each design on expressiveness and user cognition, and evaluate the designs based on a survey of 273 professional participants of whom 110 had no previous programming experience. We further present an interactive experiment based on a prototype implementation of the design we deem best. This experiment confirmed that novices can successfully use our prototype to complete realistic robotics tasks. This work contributes to making coordinated programming of robots accessible to end-users. It further explores how visual programming elements can make traditionally challenging programming tasks more beginner-friendly.
NAS parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel...
详细信息
NAS parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.
暂无评论