Quantum computers promise computational advantages for many important problems across various application domains. Unfortunately, physical quantum devices are highly susceptible to errors that limit us from running mo...
详细信息
ISBN:
(纸本)9781665420273
Quantum computers promise computational advantages for many important problems across various application domains. Unfortunately, physical quantum devices are highly susceptible to errors that limit us from running most of these quantum applications. Quantum Error Correction (QEC) codes are required to implement Fault-Tolerant Quantum computers (FTQC) on which computations can be performed without encountering errors. Error decoding is a critical component of quantum error correction and is responsible for transforming a set of qubit measurements generated by the QEC code, called the syndrome, into error locations and error types. For the feasibility of implementation, error decoders must not only identify errors with high accuracy, but also be fast and scalable to a large number of qubits. Unfortunately, most of the prior works on error decoding have focused primarily only on the accuracy and have relied on software implementations that are too slow to be of practical use. Furthermore, these studies only look at designing a single decoder and do not analyze the challenges involved in scaling the storage and bandwidth requirements when performing error correction in large systems with thousands of qubits. In this paper, we present AFS, an accurate, fast, and scalable decoder architecture that is designed to operate in the context of systems with hundreds of logical qubits. We present the hardware implementation of AFS, which is based on the Union Find decoding algorithm and employs a three-stage pipelined design. AFS provides orders of magnitude higher accuracy compared to recent SFQ-based hardware decoders (logical error rate of 6 x 10(-10) for physical error rate of 10(-3)) and low decoding latency (42ns on average), while being robust to measurement errors introduced while extracting syndromes during the QEC cycles. We also reduce the amount of decoding hardware required to perform QEC simultaneously on all the logical qubits by co-designing the micro-architectur
Message Passing Interface (MPI) is a well-known standard for programming distributed and HPC systems. While the community has been continuously improving MPI to address the requirements of next-generation architecture...
详细信息
This paper presents our implementation of an Out-of-Order RISC-V core design. The increasing relevance of open-standard computerarchitectures and corresponding tools makes the development of custom hardware an increa...
详细信息
GPUs are often underutilized due to applications’ inability to fully exploit resources. Existing API remoting techniques for GPU virtualization are fragile, requiring interception of over 2,000 complex, rapidly evolv...
详细信息
FPGA is a promising platform in designing a hardware accelerator due to its design flexibility and fast development cycle, despite the device's limited hardware resources. To address this, latest FPGAs have adopte...
详细信息
ISBN:
(纸本)9781665483322
FPGA is a promising platform in designing a hardware accelerator due to its design flexibility and fast development cycle, despite the device's limited hardware resources. To address this, latest FPGAs have adopted a multi-die architecture providing abundant hardware resources with high yield and cost-benefit. However, the multi-die architecture causes critical timing issues when signal paths cross the die-to-die boundaries, adding another design challenge in using FPGA. We propose OpenMDS, an open-source shell generation framework for high-performance design on multi-die FPGAs. Based on the user's design requirements, it generates an optimized shell for the target FPGA via automated bus pipelining, customized floorplanning, and scalable clocking scheme.
Modern data center network often possesses multiple end-to-end parallel paths, which undertake the crucial task of transmitting vast heterogeneous data traffic generated by a wide variety of applications. To fully uti...
详细信息
The slowdown of Moore's Law, combined with advances in 3D stacking of logic and memory, have pushed architects to revisit the concept of processing-in-memory (PIM) to overcome the memory wall bottleneck. This PIM ...
详细信息
ISBN:
(纸本)9781665420273
The slowdown of Moore's Law, combined with advances in 3D stacking of logic and memory, have pushed architects to revisit the concept of processing-in-memory (PIM) to overcome the memory wall bottleneck. This PIM renaissance finds itself in a very different computing landscape from the one twenty years ago, as more and more computation shifts to the cloud. Most PIM architecture papers still focus on best-effort applications, while PIM's impact on latency-critical cloud applications is not well understood. This paper explores how datacenters can exploit PIM architectures in the context of latency-critical applications. We adopt a general-purpose cloud server with HBM-based, 3D stacked logic+memory modules, and study the impact of PIM on six diverse interactive cloud applications. We reveal the previously neglected opportunity that PIM presents to these services, and show the importance of properly managing PIM-related resources to meet the QoS targets of interactive services and maximize resource efficiency. Then, we present PIMCloud, a QoS-aware resource manager designed for cloud systems with PIM allowing colocation of multiple latency-critical and best-effort applications. We show that PIMCloud efficiently manages PIM resources: it (1) improves effective machine utilization by up to 70% and 85% (average 24% and 33%) under 2-app and 3-app mixes, compared to the best state-of-the-art manager;(2) helps latency-critical applications meet QoS;and (3) adapts to varying load patterns.
Malicious domains serve as significant resources for adversaries to execute cyber attacks and are crucial indicators for detecting network intrusions. In practical scenarios, malicious domains associated with various ...
详细信息
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchm...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.
暂无评论