We prove a tight parallel repetition theorem for 3-message computationally-secure quantum interactive protocols between an efficient challenger and an efficient adversary. We also prove under plausible assumptions tha...
详细信息
ISBN:
(纸本)9798400703836
We prove a tight parallel repetition theorem for 3-message computationally-secure quantum interactive protocols between an efficient challenger and an efficient adversary. We also prove under plausible assumptions that the security of 4-message computationally secure protocols does not generally decrease under parallel repetition. These mirror the classical results of Bellare, Impagliazzo, and Naor. Finally, we prove that all quantum argument systems can be generically compiled to an equivalent 3-message argument system, mirroring the transformation for quantum proof systems. As immediate applications, we show how to derive hardness amplification theorems for quantum bit commitment schemes (answering a question of Yan), EFI pairs (answering a question of Brakerski, Canetti, and Qian), public-key quantum money schemes (answering a question of Aaronson and Christiano), and quantum zero-knowledge argument systems. We also derive an XOR lemma for quantum predicates as a corollary.
Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cac...
详细信息
Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this article, we present two efficient parallelalgorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using discrete Fourier transforms preconditioning on a Krylov method to achieve a direct solver that is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform T(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o ( NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3x to 8.5x faster for aperiodic stencil problems.
We present comparison-based parallelalgorithms for sorting n comparable items subject to comparison errors. We consider errors to occur according to a well-studied framework, where the comparison of two elements retu...
详细信息
ISBN:
(纸本)9781450395458
We present comparison-based parallelalgorithms for sorting n comparable items subject to comparison errors. We consider errors to occur according to a well-studied framework, where the comparison of two elements returns the wrong answer with a fixed probability. In the persistent model, the result of the comparison of two given elements, x and y, always has the same result, and is independent of all other pairs of elements. In the non-persistent model, the result of the comparison of each pair of elements, x and y, is independent of all prior comparisons, including for x and y. It is not possible to always correctly sort a given input set in the persistent model, so we study algorithms that achieve a small maximum dislocation and small total dislocation of the elements in the output permutation. In this paper, we provide parallelalgorithms for sorting with comparison errors in the persistent and non-persistent models. Our algorithms are asymptotically optimal in terms of their span, work, and, in the case of persistent errors, maximum and total dislocation. The main results are algorithms for the binary-forking parallel model with atomics, but we also provide algorithms for the CREW PRAM model. Our algorithms include a number of novel techniques and analysis tools, including a PRAM-to-binary-forking-model simulation result, and are the first optimal parallelalgorithms for the persistent model and the non-persistent model in the binary-forking parallel model with atomics. In particular, our algorithms haveO(log n) span, O(n log n) work, and, in the case of the persistent model, O(log n) maximum dislocation and O(n) total dislocation, with high probability. We achieve similar results for the CREW PRAM model, which are the first optimal methods for the persistent model and the first optimal results for the non-persistent model with reasonable constant factors in the performance bounds.
The growing interest in novel dataflow architectures and streaming execution paradigms has created the need for a simulator optimized for modeling dataflow systems. To fill this need, we present three new techniques t...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
The growing interest in novel dataflow architectures and streaming execution paradigms has created the need for a simulator optimized for modeling dataflow systems. To fill this need, we present three new techniques that make it feasible to simulate complex systems consisting of thousands of components. First, we introduce an interface based on Communicating Sequential Processes which allows users to simultaneously describe functional and timing characteristics. Second, we introduce a scalable point-to-point synchronization scheme that avoids global synchronization. Finally, we demonstrate a technique to exploit slack in the simulated system, such as FIFOs, to increase simulation parallelism. We implement these techniques in the Dataflow Abstract Machine (DAM), a parallel simulator framework for dataflow systems. We demonstrate the benefits of using DAM by highlighting three case studies using the framework. First, we use DAM directly as an exploration tool for streaming algorithms on dataflow hardware. We simulate two different implementations of the attention algorithm used in large language models, and use DAM to show that the second implementation only requires a constant amount of local memory. Second, we re-implement a simulator for a sparse tensor algebra accelerator, resulting in 57% less code and a simulation speedup of up to four orders of magnitude. Finally, we demonstrate a general technique for time-multiplexing real hardware to simulate multiple virtual copies of the hardware using DAM.
This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on...
详细信息
ISBN:
(纸本)9798400704352
This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointerbased data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. Whencompared with cache-optimized trees, PMAswere slower to update but faster to scan. The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3x faster batch-insert throughput and 4x faster range-query throughput compared with compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees. We further evaluate the CPMA compared with compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. The CPMA is on average 1.2x faster on a suite of graph algorithms and 2x faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3x faster on graph algorithms and 2x faster on batch inserts compared with Aspen.
We consider the classic k-center problem in a parallel setting, on the low-local-space Massively parallel Computation (MPC) model, with local space per machine of O(n(delta)), where delta is an element of(0, 1) is an ...
详细信息
ISBN:
(纸本)9781450395458
We consider the classic k-center problem in a parallel setting, on the low-local-space Massively parallel Computation (MPC) model, with local space per machine of O(n(delta)), where delta is an element of(0, 1) is an arbitrary constant. As a central clustering problem, the k-center problem has been studied extensively. Still, until very recently, all parallel MPC algorithms have been requiring Omega(k) or even Omega(kn(delta)) local space per machine. While this setting covers the case of small values of k, for a large number of clusters these algorithms require large local memory, making them poorly scalable. The case of large k, k >= Omega(n(delta)), has been considered recently for the low-local-space MPC model by Bateni et al. (2021), who gave an O(log log n)-round MPC algorithm that produces k(1 + o(1)) centers whose cost has multiplicative approximation of O(log log log n). In this paper we extend the algorithm of Bateni et al. and design a low-local-space MPC algorithm that in O(log log..) rounds returns a clustering with k(1 + o(1)) clusters that is an O(log*n)-approximation for k-center.
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Tes...
详细信息
ISBN:
(纸本)9798400714436
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.
We design a parallel algorithm for the Constrained Shortest Path (CSP) problem. The CSP problem is known to be NP-hard and there exists a pseudo-polynomial time sequential algorithm that solves it. To design the paral...
详细信息
The paper presents parallelalgorithms for multiplying implicit simple unit-Monge matrices (Krusche and Tiskin, PPAM 2009) of size n x n in the EREW PRAM model. We show implicit simple unit-Monge matrices multiplicati...
详细信息
ISBN:
(纸本)9781450395458
The paper presents parallelalgorithms for multiplying implicit simple unit-Monge matrices (Krusche and Tiskin, PPAM 2009) of size n x n in the EREW PRAM model. We show implicit simple unit-Monge matrices multiplication of size nxn can be achieved by a deterministic EREW PRAM algorithm with O(n log n log log n) total work and O(log(3) n) span. This implies that there is a deterministic EREW PRAM algorithm solving the longest increasing subsequence (LIS) problem in O(n log(2) n log log n) work and O(log(4) n) span. Furthermore, with randomization and bitwise operations, implicitly multiplying two simple unit-Monge matrices can be improved to O(n log n) work and O(log(3) n) span, which leads to a randomized EREW PRAM algorithm obtaining LIS in O(n log(2) n) work and O(log(4) n) span with high probability. In the regime where the LIS has length k = Omega(log(3) n), our results improve the span from (O) over tilde (n(2/3)) (Krusche and Tiskin, SPAA 2010) and O(k log n) (Gu, Men, Shen, Sun, and Wan, SPAA 2023) to O(log(4) n) while the total work remains near optimal (O) over tilde (n).
Various near-data processing (NDP) designs have been proposed to alleviate the memory wall challenge for data-intensive applications. Among them, near-DRAM-bank NDP architectures, by incorporating logic near each DRAM...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Various near-data processing (NDP) designs have been proposed to alleviate the memory wall challenge for data-intensive applications. Among them, near-DRAM-bank NDP architectures, by incorporating logic near each DRAM bank, promise the highest efficiency and have already been commercially available now. However, due to physical isolation, fast and direct cross-bank communication is impossible in these architectures, limiting their usage to only simple parallel patterns. Applications may also suffer from severe load imbalance if each bank contains data with diverse computation loads. We thus propose NDPBridge, with novel hardware-software co-design to enable cross-bank communication and dynamic load balancing for near-bank NDP systems. We introduce hardware bridges along the DRAM hierarchy to coordinate message transfers among banks. The hardware changes are constrained and do not disrupt the existing DDR links and protocols. We further enable hierarchical and data-transfer-aware load balancing, built upon the above hardware communication path and a task-based programming model. The data transfer overheads are minimized with several novel optimizations to hide latency, avoid congestion, and reduce traffic. Our evaluation shows that NDPBridge significantly outperforms existing NDP designs by 2.23x to 2.98x on average.
暂无评论