In the model of measurement-based quantum computing (MBQC), computations are performed via sequential measurements on a highly entangled graph state. MBQC is a natural model for photonic quantum computing and has been...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
In the model of measurement-based quantum computing (MBQC), computations are performed via sequential measurements on a highly entangled graph state. MBQC is a natural model for photonic quantum computing and has been shown to be useful for tasks like optimization and verification of general quantum computations. Therefore, it is often necessary to translate between MBQC and the predominantly used quantum circuit model in a fast and reliable way. While there are algorithms with linear complexity to extract quantum circuits from measurement patterns using additional ancilla qubits, efficient ancilla-free extraction has shown to be more costly. We develop strategies to parallelize an existing extraction algorithm based on ZX-calculus by exploiting the graph structure of measurement patterns and evaluate the performance on patterns obtained from a benchmark set of quantum circuits. Our results suggest that possible parallelization speedups are closely related to the graph structure of a pattern.
Extreme scale graph analytics is imperative for several real-world Big Data applications with the underlying graph structure containing millions or billions of vertices and edges. Since such huge graphs cannot fit int...
详细信息
ISBN:
(数字)9781665488020
ISBN:
(纸本)9781665488020
Extreme scale graph analytics is imperative for several real-world Big Data applications with the underlying graph structure containing millions or billions of vertices and edges. Since such huge graphs cannot fit into the memory of a single computer, distributed processing of the graph is required. Several frameworks have been developed for performing graph processing on distributedsystems. The frameworks focus primarily on choosing the right computation model and the partitioning scheme under the assumption that such design choices will automatically reduce the communication overheads. For any computational model and partitioning scheme, communication schemes - the data to be communicated and the virtual interconnection network among the nodes - have significant impact on the performance. To analyze this impact, in this work, we identify widely used communication schemes and estimate their performance. Analyzing the trade-offs between the number of compute nodes and communication costs of various schemes on a distributed platform by brute force experimentation can be prohibitively expensive. Thus, our performance estimation models provide an economic way to perform the analyses given the partitions and the communication scheme as input. We validate our model on a local HPC cluster as well as the cloud hosted NSF Chameleon cluster. Using our estimates as well as the actual measurements, we compare the communication schemes and provide conditions under which one scheme should be preferred over the others.
Trusted execution environment (TEE) promises strong security guarantee with hardware extensions for security-sensitive tasks. Due to its numerous benefits, TEE has gained widespread adoption, and extended from CPU-onl...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Trusted execution environment (TEE) promises strong security guarantee with hardware extensions for security-sensitive tasks. Due to its numerous benefits, TEE has gained widespread adoption, and extended from CPU-only TEEs to FPGA and GPU TEE systems. However, existing TEE systems exhibit inadequate and inefficient support for an emerging (and significant) processing unit, NPU. For instance, commercial TEE systems resort to coarse-grained and static protection approaches for NPUs, resulting in notable performance degradation (10%-20%), limited (or no) multitasking capabilities, and suboptimal resource utilization. In this paper, we present a secure NPU architecture, known as sNPU, which aims to mitigate vulnerabilities inherent to the design of NPU architectures. First, sNPU proposes NPU Guarder to enhance the NPU's access control. Second, sNPU defines new attack surfaces leveraging in-NPU structures like scratchpad and NoC, and designs NPU Isolator to guarantee the isolation of scratchpad and NoC routing. Third, our system introduces a trusted software module called NPU Monitor to minimize the software TCB. Our prototype, evaluated on FPGA, demonstrates that sNPU significantly mitigates the runtime costs associated with security checking (from upto 20% to 0%) while incurring less than 1% resource costs.
Hardware accelerators have always been difficult to approach. In recent years, we have experienced great efforts to simplify their programming paradigms, especially on CPUs. This led to the development of various doma...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Hardware accelerators have always been difficult to approach. In recent years, we have experienced great efforts to simplify their programming paradigms, especially on CPUs. This led to the development of various domain-specific frameworks and microarchitectural features that facilitated some aspects of this multifaced problem. One such feature is the Unified Virtual Memory (UVM) oversubscription mechanism that allows the developer to handle datasets with a bigger memory footprint than the HW accelerators. Although promising, current UVM faces extreme overheads when running large workloads that reach an oversubscription factor (allocated vs. available memory) ampler than a per-workload threshold. In this work, we propose GrOUT, a language- and domain-agnostic framework that tackles the slowdowns brought by the UVM oversubscription mechanism. In particular, we highlight how a scale-out approach is a feasible solution to solve the slowdowns brought by UVM on workloads from various domains. Moreover, we design a framework capable of autonomously scaling out user-provided workloads, reaching a speedup of more than 24.42x with minimal changes to the application logic.
In-time processing of database system is imperative to reveal the hidden information. JOIN operation is critical in data analysis, as it occupies almost half of the average execution time in the standard TPC-H benchma...
详细信息
ISBN:
(纸本)9781665484855
In-time processing of database system is imperative to reveal the hidden information. JOIN operation is critical in data analysis, as it occupies almost half of the average execution time in the standard TPC-H benchmark for database processing. In modern databases, transferring data between computing engines and system memory has become one of the main performance challenges. Previous works of Near Memory Computing (NMC) alleviated the costly data transfer, however, the designs still pose inefficiency in terms of processing flow and data management. In this paper, we propose FG-SMJ: a highly parallel fine-grained sort-merge join on near memory computing. The novel data layout allows us to access data from memory chips with fine-grained chip-level parallelism and exploit memory bandwidth. Compared with previous NMC designs, the proposed FG-SMJ attains 3.08x speedup.
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faul...
详细信息
ISBN:
(纸本)9798400701559
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faults, or data-copy latency, affecting HPC applications' efficiency and scalability. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Interprocess MPI Collective design that maximizes small message MPI collective performance at scale. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead, and extra data copy, improving intra- and inter-node message rate and throughput. Our design also boosts performance for larger messages, resulting in comprehensive improvement for various message sizes. Experimental results show that PiP-MColl outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for MPI collectives like MPI_Scatter and MPI_Allgather.
The rise of deep learning methods has ignited interest in efficient hardware and software systems for tensor-based computing. A question worth investigating is whether other areas in computing can benefit as well from...
详细信息
As the number of edge devices with computing resources (e.g., embedded GPUs, mobile phones, and laptops) increases, recent studies demonstrate that it can be beneficial to collaboratively run convolutional neural netw...
详细信息
ISBN:
(纸本)9781665481069
As the number of edge devices with computing resources (e.g., embedded GPUs, mobile phones, and laptops) increases, recent studies demonstrate that it can be beneficial to collaboratively run convolutional neural network (CNN) inference on more than one edge device. However, these studies make strong assumptions on the devices' conditions, and their application is far from practical. In this work, we propose a general method, called DistrEdge, to provide CNN inference distribution strategies in environments with multiple IoT edge devices. By addressing heterogeneity in devices, network conditions, and nonlinear characters of CNN computation, DistrEdge is adaptive to a wide range of cases (e.g., with different network conditions, various device types) using deep reinforcement learning technology. We utilize the latest embedded AI computing devices (e.g., NVIDIA Jetson products) to construct cases of heterogeneous devices' types in the experiment. Based on our evaluations, DistrEdge can properly adjust the distribution strategy according to the devices' computing characters and the network conditions. It achieves 1.1 to 3x speedup compared to state-of-the-art methods.
Privacy preserving data clustering is a useful method for extracting intrinsic cluster structures from distributeddatabases keeping personal privacy. In a previous research, a model of performing Fuzzy c-Lines cluste...
详细信息
ISBN:
(纸本)9781665499248
Privacy preserving data clustering is a useful method for extracting intrinsic cluster structures from distributeddatabases keeping personal privacy. In a previous research, a model of performing Fuzzy c-Lines clustering was proposed, where a privacy preserving scheme of k-means-type model was adopted with cryptographic calculation. This paper further improves the model for handling incomplete data ignoring the influences of missing values. The element-wise clustering criterion enables to derive local principal component vectors in each data sources by considering minimization of low-rank approximation of observed elements only. Then, fuzzy memberships of each object are calculated in a collaborative manner among organizations, where partial distances between objects and prototypes are derived with cryptographic framework so that intra-organization information is kept secret. The characteristic features of the proposed method are demonstrated through numerical experiments.
Emerging technologies, such as cloud computing and artificial intelligence, significantly arouse concern about data security and privacy. Homomorphic encryption (HE) is a promising invention, which enables computation...
详细信息
暂无评论