Quantum computing architectures based on neutral atoms offer large scales and high-fidelity operations. They can be heterogeneous, with different zones for storage, entangling operations, and readout. Zoned architectu...
详细信息
ISBN:
(数字)9798331506476
ISBN:
(纸本)9798331506483
Quantum computing architectures based on neutral atoms offer large scales and high-fidelity operations. They can be heterogeneous, with different zones for storage, entangling operations, and readout. Zoned architectures improve computation fidelity by shielding idling qubits in storage from side-effect noise, unlike monolithic architectures where all operations occur in a single zone. However, supporting these flexible architectures with efficient compilation remains challenging. In this paper, we propose ZAC, a scalable compiler for zoned architectures. ZAC minimizes data movement overhead between zones with qubit reuse, i.e., keeping them in the entanglement zone if an immediate entangling operation is pending. Other innovations include novel data placement and instruction scheduling strategies in ZAC, a flexible specification of zoned architectures, and an intermediate representation for zoned architectures, ZAIR. Our evaluation shows that zoned architectures equipped with ZAC achieve a 22x improvement in fidelity compared to monolithic architectures. Moreover, ZAC is shown to have a 10% fidelity gap on average compared to the ideal solution. This significant performance enhancement enables more efficient and reliable quantum circuit execution, enabling advancements in quantum algorithms and applications. ZAC is open source at https://***/UCLAVAST/ZAC
The GPU has been successfully used for diverse emerging compute-intensive applications, including imaging, computer vision, and more recently, deep learning, to name a few. To offer high performance for such applicati...
详细信息
ISBN:
(数字)9798331506476
ISBN:
(纸本)9798331506483
The GPU has been successfully used for diverse emerging compute-intensive applications, including imaging, computer vision, and more recently, deep learning, to name a few. To offer high performance for such applications, it is provisioned with massive Register Files (RFs) to exploit high Thread-Level parallelism (TLP). As RFs are designed to store thousands of contexts and provide very high bandwidth for uninterrupted supply of operands to hundreds of compute units for high TLP, they have become one of the most power-hungry components in the GPU. Meanwhile, faced with the end of Dennard scaling, it is more important than ever to innovate the microarchitecture of (power-constrained) GPUs to continuously improve the performance of future compute-intensive *** this work, we propose a GPU microarchitecture, WarpedCompaction, designed to utilize given RFs more efficiently, instead of relying on larger-capacity and/or higher-bandwidth RFs to further improve GPU performance. Specifically, first, we reverse-engineer the latest GPU’s RF organization through microbenchmarking. This uncovers that each sub-core within a streaming multiprocessor contains only two dual-ported RF banks, and accesses to these banks are arbitrated solely based on register IDs. We also reveal that, despite the modest configurations, RF banks are largely underutilized, staying inactive for 33.5% of the time. Second, we observe that previously proposed RF optimization techniques, data forwarding and dead register elimination, cannot address this underutilization problem. This is mainly due to the (R1) insufficient RF access requests from a limited number of Operand Collector Units (OCUs) and (R2) inefficient operand distribution by the conventional RF bank arbitration. Third, we present two architectural solutions to tackle the observed inefficiency: (S1) OCU sharing and (S2) skewed arbitrator. Building on enhanced OCU early allocation for partially ready instructions and operand forwardin
Integer Linear Programming (ILP) is an important mathematical approach for solving time-sensitive real-life optimization problems, including network routing, map routing, traffic scheduling, etc. However, the algorith...
详细信息
ISBN:
(数字)9798331506476
ISBN:
(纸本)9798331506483
Integer Linear Programming (ILP) is an important mathematical approach for solving time-sensitive real-life optimization problems, including network routing, map routing, traffic scheduling, etc. However, the algorithms for solving ILPs are typically sparse and branch-intensive, and not CPU/GPU friendly. In the paper “What could a million cores do to solve Integer programs”, Koch et al. [40] presented data illustrating that Integer Linear Programming (ILP) applications take tens of hours of execution time even on the largest parallel computers. Long execution time is a problem because many real-life applications need a decision in seconds or minutes. The widely used ILP solvers, like Gurobi (optimized for CPUs), perform software-based optimizations to handle the inherent sparsity in ILPs but still do not meet decision threshold because of the limited throughput of CPUs. GPUs are suited for large-sized dot-product compute, however, GPU-based ILP solvers also do not meet decision thresholds as (i) GPU is not sparsity friendly and (ii) GPU incurs thread divergence for branching, resulting in under-utilization of streaming engines and periodic host-GPU interaction. We propose SPARK, a sparsity-aware, reuse-aware, energy-efficient, reconfigurable, near-cache ILP architecture that (i) re-configures the existing L1 cache present in CPUs to perform near-cache acceleration with easy integration into the baseline CPU pipeline with minimal area overhead ($\sim 1.4 \%$ of a CPU), (ii) performs near-cache sparsity detection and sparsity-aware compute, reducing the number of insignificant computations, and data movement energy overheads, (iii) leverages the computational patterns present in algorithms used for solving ILP to realize a reuse-aware architecture, and (iv) is applicable to solving sparse and dense ILPs and LPs (Linear Programs). We observe $15 x / 20 x$, and $152 x / 740 x$ performance/energy improvement over AMD’s Zen3 CPU, and Nvidia’s Tesla v100 GPU for sparse rea
This workshop focuses on understanding the implications of accelerators on the architectures and programming environments of future systems. It seeks to ground accelerator research through studies of application kerne...
This workshop focuses on understanding the implications of accelerators on the architectures and programming environments of future systems. It seeks to ground accelerator research through studies of application kernels or whole applications on such systems, as well as tools and libraries that improve the performance or productivity of applications trying to use these systems. The goal of this workshop is to bring together researchers and practitioners who are involved in application studies for accelerators and other hybrid systems, to learn the opportunities and challenges in future design trends for HPC applications and systems.
Arbitrary-precision integer multiplication serves as the core kernel in many applications such as cryptographic algorithms, scientific computing, and etc. To compute arbitrary-precision integer multiplication using lo...
详细信息
ISBN:
(数字)9798331502812
ISBN:
(纸本)9798331502829
Arbitrary-precision integer multiplication serves as the core kernel in many applications such as cryptographic algorithms, scientific computing, and etc. To compute arbitrary-precision integer multiplication using low-bit function units (32/64-bit) on existing hardware, decomposition methods like Karatsuba and Schoolbook are usually adopted. In general, the decomposition methods use two steps to finish the calculation. First, it decomposes the two large integers into many smaller integers and generates a group of low-bit multiplications that can be calculated in a spatial or sequential manner. Second, the results of low-bit multiplications are shifted and added together to get the final result. The first step involves massive parallel byte-level processing, while the second step requires a long propagation chain, which involves bit-level processing. Prior works have leveraged vector instructions on CPUs, CUDA cores on GPUs, and DSPs on FPGAs to accelerate arbitrary-precision multiplication. We use the state-of-the-art FPGA accelerator and libraries on GPUs and CPUs, and find that the FPGA has the lowest energy efficiency. We identify that the dedicated vector units on CPUs and GPUs bring the biggest energy efficiency in the first computation step. Although DSPs and LUTs on FPGAs introduce extra energy overhead in the first step compared to dedicated vector units but are more suitable for the second computation step. To benefit both two steps, we propose the AIM framework to generate efficient arbitrary-precision integer multiplication accelerator on AMD Versal adaptive compute acceleration platform (ACAP) VCK190, which comprises 400 AI Engine (AIE) ASIC processors, an FPGA, and a ARM CPU. AIM uses 400 AIEs to compute the first step and the FPGA to process the second step. Our experimental results show that AIM achieves up to 12.6x and 2.1x energy efficiency gain over the Intel Xeon Ice Lake 6346 CPU, and Nvidia A5000 GPU respectively with the respect to the multipl
HIPS-HPGC 2005 is a full-day workshop, focusing on high-performance grid computing and high-level parallel programming models. The papers deal with component models and service-based systems for grids, emphasizing on ...
HIPS-HPGC 2005 is a full-day workshop, focusing on high-performance grid computing and high-level parallel programming models. The papers deal with component models and service-based systems for grids, emphasizing on experiences with existing systems. Also the papers report on the state of the art of grid applications, both for academic and industrial problems
The proceedings contain 84 papers. The special focus in this conference is on Architectures, Networks, Languages and Algorithms. The topics include: The impact of multicore on math software and exploiting single preci...
ISBN:
(纸本)9783540680673
The proceedings contain 84 papers. The special focus in this conference is on Architectures, Networks, Languages and Algorithms. The topics include: The impact of multicore on math software and exploiting single precision computing to obtain double precision results;trends in high performance computing for industry and research;emerging technologies and large scale facilities in HPC;architecture and performance of dynamic offloader for cluster network;dynamic anchor based mobility management scheme for mobile IP networks;choosing a load balancing scheme for agent-based digital libraries;the optimum network on chip architectures for video object plane decoder design;a hardware NIC scheduler to guarantee QOS on high performance servers;multihomed routing in multicast service overlay network;region-based multicast routing protocol with dynamic address allocation scheme in mobile ad-hoc networks;limiting the effects of deafness and hidden terminal problems in directional communications;enforcing dimension-order routing in on-chip torus networks without virtual channels;a distributed backbone formation algorithm for mobile ad hoc networks;randomized leader election protocols in noisy radio networks with a single transceiver;a hybrid intelligent preventive fault-tolerant QOS unicast routing scheme in IP over DWDM optical internet;adaptive technique for automatic communication access pattern discovery applied to data prefetching in distributedapplications using neural networks and stochastic models;process scheduling using ant colony optimization techniques and interference aware dynamic subchannel allocation in a multi-cellular OFDMA system based on traffic situation.
暂无评论