检索结果-内蒙古大学图书馆

30th symposium on Principles and Practice of parallel Programming

作者： Borsotti, Angelo Breveglieri, Luca Morzenti, Angelo Reghizzi, Stefano Crespi Politecn Milan Milan Italy CNR IEIIT Milan Italy

ISBN: (纸本)9798400714436

Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions. Our data-parallel algorithm is based on the new FA type called reduced-interface DFA (RI-DFA), which minimizes the speculation overhead without incurring in the penalty of nondeterministic transitions or of impractically enlarged DFA machines. the algorithm is theoretically efficient, because it combines the state-reduction of an NFA with the speed of deterministic transitions, thus improving on both DFA-based and NFA-based existing implementations. the practical applicability of the RI-DFA approach is confirmed by a quantitative comparison of the number of starting states for a large public benchmark of complex FAs. On multi-core computing architectures, the RI-DFA recognizer is considerably faster than the NFA-based one on all benchmarks, while it matches the DFA-based one on some benchmarks and performs much better on some others. the extra time needed to construct RI-DFA vs DFA is moderate and is compatible with a practical use. the full paper with all details is in [4]. © 2025 Copyright held by the owner/author(s).

关键词： regular language recognition data-parallel recognition algorithm minimal speculation speedup onmulti-core architecture multi-entry DFA reduced-interface DFA

来源：评论

学校读者我要写书评

暂无评论

SPAA 2023 - Proceedings of the 35th acm symposium on parallelism in algorithms and architectures

SPAA 2023 - Proceedings of the 35th ACM Symposium on Paralle...

引用

35th acm symposium on parallelism in algorithms and architectures, SPAA 2023

ISBN: (纸本)9781450395458

the proceedings contain 47 papers. the topics discussed include: Quancurrent: a concurrent quantiles sketch;an efficient scheduler for task-parallel interactive applications;efficient synchronization-light work stealing;balanced allocations in batches: the tower of two choices;massively parallel tree embeddings for high dimensional spaces;deterministic massively parallel symmetry breaking for sparse graphs;an associativity threshold phenomenon in set-associative caches;increment - and - freeze: every cache, everywhere, all of the time;multidimensional approximate agreement with asynchronous fallback;a tight characterization of fast failover routing: resiliency to two link failures is possible;releasing memory with optimistic access: a hybrid approach to memory reclamation and allocation in lock-free programs;transactional composition of nonblocking data structures;applying hazard pointers to more concurrent data structures;and nearly optimal parallel algorithms for longest increasing subsequence.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Brief Announcement: A parallel (?, G)-Stepping Algorithm for the Constrained Shortest Path Problem 34

Brief Announcement: A Parallel (?, G)-Stepping Algorithm for...

引用

34th acm symposium on parallelism in algorithms and architectures, SPAA 2022

作者： Bahreini, Tayebeh Fisher, Nathan Grosu, Daniel Wayne State University DetroitMI United States

ISBN: (纸本)9781450391467

We design a parallel algorithm for the Constrained Shortest Path (CSP) problem. the CSP problem is known to be NP-hard and there exists a pseudo-polynomial time sequential algorithm that solves it. To design the parallel algorithm, we extend the techniques used in the design of the ?-stepping algorithm for the single-source shortest paths problem. © 2022 Owner/Author.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

SPAA 2022 - Proceedings of the 34th acm symposium on parallelism in algorithms and architectures

SPAA 2022 - Proceedings of the 34th ACM Symposium on Paralle...

引用

34th acm symposium on parallelism in algorithms and architectures, SPAA 2022

ISBN: (纸本)9781450391467

the proceedings contain 44 papers. the topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth graphs;adaptive massively parallel algorithms for cut problems;preparing for disaster: leveraging precomputation to efficiently repair graph structures upon failures;the energy complexity of Las Vegas leader election;a fully-distributed peer-to-peer protocol for byzantine-resilient distributed hash tables;brief announcement: the (limited) power of multiple identities: asynchronous byzantine reliable broadcast with improved resilience through collusion;brief announcement: composable dynamic secure emulation;and robust and optimal contention resolution without collision detection.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Simple parallel algorithms for Single-Site Dynamics 2022

Simple Parallel Algorithms for Single-Site Dynamics

引用

54th annual acm SIGACT symposium on theory of Computing (STOC)

作者： Liu, Hongyang Yin, Yitong Nanjing Univ State Key Lab Novel Software Technol Nanjing Jiangsu Peoples R China

ISBN: (纸本)9781450392648

the single-site dynamics are a canonical class of Markov chains for sampling from high-dimensional probability distributions, e.g. the ones represented by graphical models. We give a simple and generic parallel algorithm that can faithfully simulate single-site dynamics. When the chain asymptotically satisfies the l(p) -Dobrushin's condition, specifically, when the Dobrushin's influence matrix has constantly bounded l(p)-induced operator norm for an arbitrary p epsilon [1, 8], the parallel simulation of.. steps of single-site updates succeeds within Omicron (Nu / n + log n) depth of parallel computing using (O) over tilde (m) processors, where n is the number of sites and.. is the size of graphical model. Since the Dobrushin's condition is almost always satisfied asymptotically by mixing chains, this parallel simulation algorithm essentially transforms single-site dynamics with optimal o (n log n) mixing time to RNC algorithms for sampling. In particular we obtain RNC samplers, for the Ising models on general graphs in the uniqueness regime, and for satisfying solutions of CNF formulas in a local lemma regime. With non-adaptive simulated annealing, these RNC samplers can be transformed routinely to RNC algorithms for approximate counting. A key step in our parallel simulation algorithm, is a so-called "universal coupling" procedure, which tries to simultaneously couple all distributions over the same sample space. We construct such a universal coupling, that for every pair of distributions the coupled probability is at least their Jaccard similarity. We also prove that this is optimal in the worst case. the universal coupling and its applications are of independent interests.

关键词： Single-site dynamics Markov chain Monte Carlo sampling

来源：评论

学校读者我要写书评

暂无评论

Training one DeePMD Model in Minutes: a Step towards Online Learning 24

Training one DeePMD Model in Minutes: a Step towards Online ...

引用

29th acm SIGPLAN annual symposium on Principles and Practice of parallel Programming (PPoPP)

作者： Hu, Siyu Zhao, Tong Sha, Qiuchen Li, Enji Meng, Xiangyu Liu, Liping Wang, Lin-Wang Tan, Guangming Jia, Weile Chinese Acad Sci Inst Comp Technol State Key Lab Proc Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China China Univ Petr Qingdao Inst Software Coll Comp Sci & Technol Qingdao Peoples R China Chinese Acad Sci Inst Semicond Beijing Peoples R China

ISBN: (纸本)9798400704352

Neural Network Molecular Dynamics (NNMD) has become a major approach in material simulations, which can speedup the molecular dynamics (MD) simulation for thousands of times, while maintaining ab initio accuracy, thus has a potential to fundamentally change the paradigm of material simulations. However, there are two time-consuming bottlenecks of the NNMD developments. One is the data access of ab initio calculation results. the other, which is the focus of the current work, is reducing the training time of NNMD model. the training of NNMD model is different from most other neural network training because the atomic force (which is related to the gradient of the network) is an important physical property to be fit. Tests show the traditional stochastic gradient methods, like the Adam algorithms, cannot efficiently deploy the multisample minibatch algorithm. As a result, a typical training (taking the Deep Potential Molecular Dynamics (DeePMD) as an example) can take many hours. In this work, we designed a heuristic minibatch quasi-Newtonian optimizer based on Extended Kalman Filter method. An early reduction of gradient and error is adopted to reduce memory footprint and communication. the memory footprint, communication and settings of hyper-parameters of this new method are analyzed in detail. Computational innovations such as customized kernels of the symmetry-preserving descriptor are applied to exploit the computing power of the heterogeneous architecture. Experiments are performed on 8 different datasets representing different real case situations, and numerical results show that our new method has an average speedup of 32.2 compared to the Reorganized Layer-wised Extended Kalman Filter with 1 GPU, reducing the absolute training time of one DeePMD model from hours to several minutes, making it one step toward online training.

关键词： parallel training Molecular dynamics First principle ab initio GPU

来源：评论

学校读者我要写书评

暂无评论

MLFormer: a high performance MPC linear inference framework for transformers

引用

JOURNAL OF CRYPTOGRAPHIC ENGINEERING 2025年第1期15卷 1-20页

作者： Liu, Siqi Liu, Zhusen Chen, Donglong Dai, Wangchen Zhou, Lu Liu, Zhe Cheung, Ray C. C. Koc, Cetin Kaya BNU HKBU United Int Coll Guangdong Prov Key Lab IRADS Zhuhai 519000 Peoples R China Hangzhou Innovat Inst Beihang Univ Hangzhou 311121 Peoples R China Zhejiang Lab Hangzhou 310000 Peoples R China Sun Yat Sen Univ Shenzhen 518107 Peoples R China Nanjing Univ Aeronaut & Astronaut Nanjing 210000 Peoples R China City Univ Hong Kong Hong Kong 310000 Peoples R China Igdir Univ Igdir Turkiye Univ Calif Santa Barbara St Barbara CA USA

Transformer-based models are widely used in natural language processing tasks, and their application has been further extended to computer vision as well. In their usage, data security has become a crucial concern when deploying deep learning services on cloud platforms. To address these security concerns, Multi-party computation (MPC) is employed to prevent data and model leakage during the inference process. However, Transformer model introduces several challenges for MPC computation, including the time overhead of the Softmax (normalized exponential) function, the accuracy issue caused by the "dynamic range" of approximated division and exponential, and the high memory overhead when processing long sequences. To overcome these challenges, we propose MLformer, an MPC-based inference framework for transformer models based on Crypten Knott et al. (Adv Neural Inf Process Syst 34: 4961-4973, 2021), a secure machine learning framework suggested by Facebook AI Research group, in the semi-honest adversary model. In this framework, we replace the softmax attention with linear attention, which has linear time and memory complexity with input length. the modification eliminates the softmax function entirely, resulting in lower time and memory overhead. To ensure the accuracy of linear attention, we propose the scaled linear attention to address the dynamic range issue caused by the MPC division used and a new approximate division function is proposed to reduce the computational time of the attention block. Furthermore, to improve the efficiency and accuracy of MPC exponential and reciprocal which are commonly used in transformer model, we propose a novel MPC exponential protocol and first integrate the efficient reciprocal protocol Bar-Ilan and Beaver (in Proceedings of the 8th annual acm symposium on principles of distributed computing, pp. 201-209, 1989) to our framework. Additionally, we optimize the computation of causal linear attention, which is utilized in private in

关键词： Multi-party computation Linear transformer Private inference parallel processing GPU

来源：评论

学校读者我要写书评

暂无评论

A Study of Work Distribution and Contention in Database Primitives on Heterogeneous CPU/GPU architectures 21

A Study of Work Distribution and Contention in Database Prim...

引用

36th annual acm symposium on Applied Computing (SAC)

作者： Gowanlock, Michael Fink, Zane Karsin, Ben Wright, Jordan No Arizona Univ Sch Informat Comp & Cyber Syst Flagstaff AZ 86011 USA Univ Illinois Dept Comp Sci Urbana IL USA Univ Libre Bruxelles Dept Comp Sci Brussels Belgium

ISBN: (纸本)9781450381048

Graphics Processing Units (GPUs) provide very high on-card memory bandwidth which can be exploited to address data-intensive workloads. To maximize algorithm throughput, it is important to concurrently utilize both the CPU and GPU to carry out database queries. We select data-intensive algorithms that are common in databases and data analytic applications including: (i) scan;(ii) batched predecessor searches;(iii) multiway merging;and, (iv) partitioning. For each algorithm, we examine the performance of parallel CPU/GPU-only, and hybrid CPU/GPU approaches. there are several challenges to combining the CPU and GPU for query processing, including distributing work between architectures. We demonstrate that despite being able to accurately split the work between the CPU and GPU, contention for memory band-width is a major limiting factor for hybrid CPU/GPU data-intensive algorithms. We employ performance models that allowus to explore several research questions. We find that while hybrid data-intensive algorithms may be limited by contention, these algorithms are more robust to workload characteristics;therefore, they are preferable to CPU/GPU-only approaches. We also find that hybrid algorithms achieve good performance when there is low memory contention between the CPU and GPU, such that the GPU can perform its operations without significantly reducing CPU throughput.

关键词： GPGPU Heterogeneous Systems Hybrid algorithms In-memory Database Memory-Bound algorithms Multiway Merge Partitioning Predecessor Search Scan

来源：评论

学校读者我要写书评

暂无评论

Accelerating Multi-Process Communication for parallel 3-D FFT

Accelerating Multi-Process Communication for Parallel 3-D FF...

引用

8th Workshop on Exascale MPI (ExaMPI)

作者： Ayala, Alan Tomov, Stan Stoyanov, Miroslav Haidar, Azzam Dongarra, Jack Univ Tennessee Knoxville TN 37996 USA Oak Ridge Natl Lab Oak Ridge TN USA Nvidia Corp Santa Clara CA USA Univ Manchester Manchester Lancs England

ISBN: (纸本)9781665411080

Today's largest and most powerful supercomputers in the world are built on heterogeneous platforms;and using the combined power of multi-core CPUs and GPUs, has had a great impact accelerating large-scale applications. However, on these architectures, parallel algorithms, such as the Fast Fourier Transform (FFT), encounter that inter-processor communication become a bottleneck and limits their scalability. In this paper, we present techniques for speeding up multi-process communication cost during the computation of FFTs, considering hybrid network connections as those expected on upcoming exascale machines. Among our techniques, we present algorithmic tuning, making use of phase diagrams;parametric tuning, using different FFT settings;and MPI distribution tuning based on FFT size and computational resources available. We present several experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 40,960 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.

关键词： Exascale FFT Hybrid systems Scalability MPI tuning

来源：评论

学校读者我要写书评

暂无评论

Proceedings of the 11th Workshop on parallel Programming and Run-Time Management Techniques for Many-Core architectures / 9th Workshop on Design Tools and architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2020

Proceedings of the 11th Workshop on Parallel Programming and...

引用

8th International symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2017

ISBN: (纸本)9781450375450

the proceedings contain 5 papers. the topics discussed include: sparse matrix-dense matrix multiplication on heterogeneous CPU+FPGA embedded system;run-time power modeling in embedded GPUs with dynamic voltage and frequency scaling;fault-tolerant online scheduling algorithms for CubeSats;an OpenMP parallel genetic algorithm for design space exploration of heterogeneous multi-processor embedded systems;and automated precision tuning in activity classification systems: a case study.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：