检索结果-内蒙古大学图书馆

56th annual acm symposium on Theory of Computing (STOC)

作者： Bostanci, John Qian, Luowen Spooner, Nicholas Yuen, Henry Columbia Univ New York NY 10027 USA Boston Univ Boston MA 02215 USA Univ Warwick Coventry W Midlands England NYU New York NY USA

ISBN: (纸本)9798400703836

We prove a tight parallel repetition theorem for 3-message computationally-secure quantum interactive protocols between an efficient challenger and an efficient adversary. We also prove under plausible assumptions that the security of 4-message computationally secure protocols does not generally decrease under parallel repetition. These mirror the classical results of Bellare, Impagliazzo, and Naor. Finally, we prove that all quantum argument systems can be generically compiled to an equivalent 3-message argument system, mirroring the transformation for quantum proof systems. As immediate applications, we show how to derive hardness amplification theorems for quantum bit commitment schemes (answering a question of Yan), EFI pairs (answering a question of Brakerski, Canetti, and Qian), public-key quantum money schemes (answering a question of Aaronson and Christiano), and quantum zero-knowledge argument systems. We also derive an XOR lemma for quantum predicates as a corollary.

关键词： average case complexity direct product post-quantum security puzzle

来源：评论

学校读者我要写书评

暂无评论

A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms

引用

acm TRANSACTIONS ON parallel COMPUTING 2023年第4期10卷 1-34页

作者： Ahmad, Zafar Chowdhury, Rezaul Das, Rathish Ganapathi, Pramod Gregory, Aaron Zhu, Yimin SUNY Stony Brook Stony Brook NY 11794 USA Univ Houston Houston TX USA

Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this article, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using discrete Fourier transforms preconditioning on a Krylov method to achieve a direct solver that is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform T(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o ( NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3x to 8.5x faster for aperiodic stencil problems.

关键词： Stencil computation fast fourier transform divide-and-conquer parallel stencil solver

来源：评论

学校读者我要写书评

暂无评论

Optimal parallel Sorting with Comparison Errors 23

Optimal Parallel Sorting with Comparison Errors

引用

35th acm symposium on parallelism in algorithms and architectures (SPAA)

作者： Goodrich, Michael T. Jacob, Riko Univ Calif Irvine Irvine CA 92697 USA IT Univ Copenhagen Copenhagen Denmark

ISBN: (纸本)9781450395458

We present comparison-based parallel algorithms for sorting n comparable items subject to comparison errors. We consider errors to occur according to a well-studied framework, where the comparison of two elements returns the wrong answer with a fixed probability. In the persistent model, the result of the comparison of two given elements, x and y, always has the same result, and is independent of all other pairs of elements. In the non-persistent model, the result of the comparison of each pair of elements, x and y, is independent of all prior comparisons, including for x and y. It is not possible to always correctly sort a given input set in the persistent model, so we study algorithms that achieve a small maximum dislocation and small total dislocation of the elements in the output permutation. In this paper, we provide parallel algorithms for sorting with comparison errors in the persistent and non-persistent models. Our algorithms are asymptotically optimal in terms of their span, work, and, in the case of persistent errors, maximum and total dislocation. The main results are algorithms for the binary-forking parallel model with atomics, but we also provide algorithms for the CREW PRAM model. Our algorithms include a number of novel techniques and analysis tools, including a PRAM-to-binary-forking-model simulation result, and are the first optimal parallel algorithms for the persistent model and the non-persistent model in the binary-forking parallel model with atomics. In particular, our algorithms haveO(log n) span, O(n log n) work, and, in the case of the persistent model, O(log n) maximum dislocation and O(n) total dislocation, with high probability. We achieve similar results for the CREW PRAM model, which are the first optimal methods for the persistent model and the first optimal results for the non-persistent model with reasonable constant factors in the performance bounds.

关键词： sorting noisy searching algorithm analysis randomized algorithms

来源：评论

学校读者我要写书评

暂无评论

The Dataflow Abstract Machine Simulator Framework 51

The Dataflow Abstract Machine Simulator Framework

引用

acm/IEEE 51st annual International symposium on Computer Architecture (ISCA)

作者： Zhang, Nathan Lacouture, Rubens Sohn, Gina Mure, Paul Zhang, Qizheng Kjolstad, Fredrik Olukotun, Kunle Stanford Univ Stanford CA 94305 USA MIT 77 Massachusetts Ave Cambridge MA 02139 USA

ISBN: (纸本)9798350326598;9798350326581

The growing interest in novel dataflow architectures and streaming execution paradigms has created the need for a simulator optimized for modeling dataflow systems. To fill this need, we present three new techniques that make it feasible to simulate complex systems consisting of thousands of components. First, we introduce an interface based on Communicating Sequential Processes which allows users to simultaneously describe functional and timing characteristics. Second, we introduce a scalable point-to-point synchronization scheme that avoids global synchronization. Finally, we demonstrate a technique to exploit slack in the simulated system, such as FIFOs, to increase simulation parallelism. We implement these techniques in the Dataflow Abstract Machine (DAM), a parallel simulator framework for dataflow systems. We demonstrate the benefits of using DAM by highlighting three case studies using the framework. First, we use DAM directly as an exploration tool for streaming algorithms on dataflow hardware. We simulate two different implementations of the attention algorithm used in large language models, and use DAM to show that the second implementation only requires a constant amount of local memory. Second, we re-implement a simulator for a sparse tensor algebra accelerator, resulting in 57% less code and a simulation speedup of up to four orders of magnitude. Finally, we demonstrate a general technique for time-multiplexing real hardware to simulate multiple virtual copies of the hardware using DAM.

关键词： Dataflow Accelerators Modeling parallel Discrete Event Simulation

来源：评论

学校读者我要写书评

暂无评论

CPMA: An Efficient Batch-parallel Compressed Set Without Pointers 24

CPMA: An Efficient Batch-Parallel Compressed Set Without Poi...

引用

29th acm SIGPLAN annual symposium on Principles and Practice of parallel Programming (PPoPP)

作者： Wheatman, Brian Burns, Randal Buluc, Aydin Xu, Helen Johns Hopkins Univ Baltimore MD 21218 USA Lawrence Berkeley Natl Lab Lawrence KS USA Georgia Inst Technol Atlanta GA USA

ISBN: (纸本)9798400704352

This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointerbased data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. Whencompared with cache-optimized trees, PMAswere slower to update but faster to scan. The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3x faster batch-insert throughput and 4x faster range-query throughput compared with compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees. We further evaluate the CPMA compared with compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. The CPMA is on average 1.2x faster on a suite of graph algorithms and 2x faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3x faster on graph algorithms and 2x faster on batch inserts compared with Aspen.

关键词： packed memory array batch-parallel compression data structures dynamic graphs

来源：评论

学校读者我要写书评

暂无评论

On parallel k-Center Clustering 23

On Parallel k-Center Clustering

引用

35th acm symposium on parallelism in algorithms and architectures (SPAA)

作者： Coy, Sam Czumaj, Artur Mishra, Gopinath Univ Warwick Coventry W Midlands England

ISBN: (纸本)9781450395458

We consider the classic k-center problem in a parallel setting, on the low-local-space Massively parallel Computation (MPC) model, with local space per machine of O(n(delta)), where delta is an element of(0, 1) is an arbitrary constant. As a central clustering problem, the k-center problem has been studied extensively. Still, until very recently, all parallel MPC algorithms have been requiring Omega(k) or even Omega(kn(delta)) local space per machine. While this setting covers the case of small values of k, for a large number of clusters these algorithms require large local memory, making them poorly scalable. The case of large k, k >= Omega(n(delta)), has been considered recently for the low-local-space MPC model by Bateni et al. (2021), who gave an O(log log n)-round MPC algorithm that produces k(1 + o(1)) centers whose cost has multiplicative approximation of O(log log log n). In this paper we extend the algorithm of Bateni et al. and design a low-local-space MPC algorithm that in O(log log..) rounds returns a clustering with k(1 + o(1)) clusters that is an O(log*n)-approximation for k-center.

关键词： MPC k-center clustering parallel computing MapReduce

来源：评论

学校读者我要写书评

暂无评论

SBMGT: Scaling Bayesian Multinomial Group Testing 25

SBMGT: Scaling Bayesian Multinomial Group Testing

引用

30th symposium on Principles and Practice of parallel Programming

作者： Chen, Weicong Qi, Hao Tatsuoka, Curtis Lu, Xiaoyi Univ Calif Merced Merced CA 95343 USA Univ Pittsburgh Pittsburgh PA 15260 USA

ISBN: (纸本)9798400714436

Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.

关键词： Multinomial group testing Bayesian methods parallel algorithms Graph algorithms

来源：评论

学校读者我要写书评

暂无评论

Brief Announcement: A parallel (?, G)-Stepping Algorithm for the Constrained Shortest Path Problem 34

Brief Announcement: A Parallel (?, G)-Stepping Algorithm for...

引用

34th acm symposium on parallelism in algorithms and architectures, SPAA 2022

作者： Bahreini, Tayebeh Fisher, Nathan Grosu, Daniel Wayne State University DetroitMI United States

ISBN: (纸本)9781450391467

We design a parallel algorithm for the Constrained Shortest Path (CSP) problem. The CSP problem is known to be NP-hard and there exists a pseudo-polynomial time sequential algorithm that solves it. To design the parallel algorithm, we extend the techniques used in the design of the ?-stepping algorithm for the single-source shortest paths problem. © 2022 Owner/Author.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Nearly Optimal parallel algorithms for Longest Increasing Subsequence 23

Nearly Optimal Parallel Algorithms for Longest Increasing Su...

引用

35th acm symposium on parallelism in algorithms and architectures (SPAA)

作者： Cao, Nairen Huang, Shang-En Su, Hsin-Hao Boston Coll Boston MA 02467 USA

ISBN: (纸本)9781450395458

The paper presents parallel algorithms for multiplying implicit simple unit-Monge matrices (Krusche and Tiskin, PPAM 2009) of size n x n in the EREW PRAM model. We show implicit simple unit-Monge matrices multiplication of size nxn can be achieved by a deterministic EREW PRAM algorithm with O(n log n log log n) total work and O(log(3) n) span. This implies that there is a deterministic EREW PRAM algorithm solving the longest increasing subsequence (LIS) problem in O(n log(2) n log log n) work and O(log(4) n) span. Furthermore, with randomization and bitwise operations, implicitly multiplying two simple unit-Monge matrices can be improved to O(n log n) work and O(log(3) n) span, which leads to a randomized EREW PRAM algorithm obtaining LIS in O(n log(2) n) work and O(log(4) n) span with high probability. In the regime where the LIS has length k = Omega(log(3) n), our results improve the span from (O) over tilde (n(2/3)) (Krusche and Tiskin, SPAA 2010) and O(k log n) (Gu, Men, Shen, Sun, and Wan, SPAA 2023) to O(log(4) n) while the total work remains near optimal (O) over tilde (n).

关键词： Longest increasing subsequence Implicit simple unit-Monge matrix multiplication parallel algorithm

来源：评论

学校读者我要写书评

暂无评论

NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing architectures 51

NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Ban...

引用

acm/IEEE 51st annual International symposium on Computer Architecture (ISCA)

作者： Tian, Boyu Li, Yiwei Jiang, Li Cai, Shuangyu Gao, Mingyu Tsinghua Univ Beijing Peoples R China Shanghai Qi Zhi Inst Shanghai Peoples R China Shanghai Jiao Tong Univ Shanghai Peoples R China Huawei Technol Co Ltd Shenzhen Peoples R China

ISBN: (纸本)9798350326598;9798350326581

Various near-data processing (NDP) designs have been proposed to alleviate the memory wall challenge for data-intensive applications. Among them, near-DRAM-bank NDP architectures, by incorporating logic near each DRAM bank, promise the highest efficiency and have already been commercially available now. However, due to physical isolation, fast and direct cross-bank communication is impossible in these architectures, limiting their usage to only simple parallel patterns. Applications may also suffer from severe load imbalance if each bank contains data with diverse computation loads. We thus propose NDPBridge, with novel hardware-software co-design to enable cross-bank communication and dynamic load balancing for near-bank NDP systems. We introduce hardware bridges along the DRAM hierarchy to coordinate message transfers among banks. The hardware changes are constrained and do not disrupt the existing DDR links and protocols. We further enable hierarchical and data-transfer-aware load balancing, built upon the above hardware communication path and a task-based programming model. The data transfer overheads are minimized with several novel optimizations to hide latency, avoid congestion, and reduce traffic. Our evaluation shows that NDPBridge significantly outperforms existing NDP designs by 2.23x to 2.98x on average.

关键词： near-data processing processing-in-memory DRAM communication load balance

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：