检索结果-内蒙古大学图书馆

ACM/SPEC International Conference on Performance Engineering (ICPE)

作者： Gutierrez, Julian Agostini, Nicolas Bohm Kaeli, David Northeastern Univ Boston MA 02115 USA

ISBN: (纸本)9781450383318

Advances in deep neural networks have provided a significant improvement in accuracy and speed across a large range of Computer Vision (CV) applications. However, our ability to perform real-time CV on edge devices is severely restricted by their limited computing capabilities. In this paper we employ Vega, a parallel graph-based framework, to study the performance limitations of four heterogeneous edge-computing platforms, while running 12 popular deep learning CV applications. We expand the framework's capabilities, introducing two new performance enhancements: 1) an adaptive stage instance controller (ASI-C) that can improve performance by dynamically selecting the number of instances for a given stage of the pipeline;and 2) an adaptive input resolution controller (AIR-C) to improve responsiveness and enable real-time performance. These two solutions are integrated together to provide a robust real-time solution. Our experimental results show that ASI-C improves run-time performance by 1.4x on average across all heterogeneous platforms, achieving a maximum speedup of 4.3x while running face detection executed on a high-end edge device. We demonstrate that our integrated optimization framework improves performance of applications and is robust to changing execution patterns.

关键词： pipeline framework heterogeneous systems parallel algorithms run-time performance management SoC machine learning

来源：评论

学校读者我要写书评

暂无评论

IExchange: Asynchronous Communication and Termination Detection for Iterative algorithms 11

IExchange: Asynchronous Communication and Termination Detect...

引用

IEEE 11th Symposium on Large Data Analysis and Visualization (LDAV)

作者： Morozov, Dmitriy Peterka, Tom Guo, Hanqi Raj, Mukund Xu, Jiayi Shen, Han-Wei Lawrence Berkeley Natl Lab Berkeley CA 94720 USA Argonne Natl Lab Lemont IL USA Ohio State Univ Columbus OH 43210 USA

ISBN: (纸本)9781665432832

Iterative parallel algorithms can be implemented by synchronizing after each round. This bulk-synchronous parallel (BSP) pattern is inefficient when strict synchronization is not required: global synchronization is costly at scale and prohibits amortizing load imbalance over the entire execution, and termination detection is challenging with irregular data-dependent communication. We present an asynchronous communication protocol that efficiently interleaves communication with computation. The protocol includes global termination detection without obstructing computation and communication between nodes. The user's computational primitive only needs to indicate when local work is done;our algorithm detects when all processors reach this state. We do not assume that global work decreases monotonically, allowing processors to create new work. We illustrate the utility of our solution through experiments, including two large data analysis and visualization codes: parallel particle advection and distributed union-find. Our asynchronous algorithm is several times faster with better strong scaling efficiency than the synchronous approach.

关键词： Computing methodologies parallel computing methodologies parallel algorithms Massively parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

A Deterministic Algorithm for the MST Problem in Constant Rounds of Congested Clique 2021

A Deterministic Algorithm for the MST Problem in Constant Ro...

引用

53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC)

作者： Nowicki, Krzysztof Univ Copenhagen Copenhagen Denmark Univ Wroclaw Wroclaw Poland

ISBN: (纸本)9781450380539

In this paper we show that the Minimum Spanning Tree problem (MST) can be solved deterministically in O(1) rounds of the Congested Clique model. In the Congested Clique model there are n players that perform computation in synchronous rounds. Each round consist of a phase of local computation and a phase of communication, in which each pair of players is allowed to exchange O(log n) bit messages. The studies of this model began with the MST problem: in the paper by Lotker, Pavlov, Patt-Shamir, and Peleg [SPAA'03, SICOMP'05] that defines the Congested Clique model the authors give a deterministic O(log log n) round algorithm that improved over a trivial O(log n) round adaptation of Boravka's algorithm. There was a sequence of gradual improvements to this result: an O(log log log n) round algorithm by Hegeman, Pandurangan, Pemmaraju, Sardeshmukh, and Scquizzato [PODC'15], an O(log* n) round algorithm by Ghaffari and Parter, [PODC'16] and an O(1) round algorithm by Jurdzinski and Nowicki, [SODA'18], but all those algorithms were randomized. Therefore, the question about the existence of any deterministic o(log log n) round algorithms for the Minimum Spanning Tree problem remains open since the seminal paper by Lotker, Pavlov, Patt-Shamir, and Peleg [SPAA'03, SICOMP'05]. Our result resolves this question and establishes that O(1) rounds is enough to solve the MST problem in the Congested CI iq ue model, even if we are not allowed to use any randomness. Furthermore, the amount of communication needed by the algorithm makes it applicable to a variant of the M PC model using machines with local memory of size O(n).

关键词： Minimum Spanning Tree MST Deterministic algorithms Graph algorithms Distributed algorithms Congested Clique parallel algorithms Massively parallel algorithms MapReduce

来源：评论

学校读者我要写书评

暂无评论

Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse Graphs 21

Communication Avoiding All-Pairs Shortest Paths Algorithm fo...

引用

50th International Conference on parallel Processing (ICPP)

作者： Zhu, Lin Hua, Qiang-Sheng Jin, Hai Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Cluster & Grid Comp LabSch Comp Sci & Technol Wuhan Peoples R China

ISBN: (纸本)9781450390682

In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are O( n(2) log(2) p/p + |S |(2) log(2) p) and O(log(2) p), respectively, where S is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are O( n(2) / root p) and O(root p log(2) p), respectively, our algorithm reduces the latency cost by a factor of O(root p), and reduces the bandwidth cost by a factor of O( root p/ log(2) p) for sparse graphs with |S | = O/( n/ root p). We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are Omega( n(2) / p + |S |(2)) and O(log(2) p), respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal.

关键词： parallel algorithms sparse graphs APSP communication complexity

来源：评论

学校读者我要写书评

暂无评论

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems 21

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Acceler...

引用

35th ACM International Conference on Supercomputing (ICS)

作者： Randall, Thomas Allen, Tyler Ge, Rong Clemson Univ Clemson SC 29631 USA

ISBN: (纸本)9781450383356

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

关键词： GPU Optimization Word2Vec parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

CPRIC: Collaborative parallelism for Randomized Incremental Constructions

CPRIC: Collaborative Parallelism for Randomized Incremental ...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Fey, Florian Gorlatch, Sergei Univ Munster Munster Germany

ISBN: (纸本)9781665435772

Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particular in combinatorial optimization and computational geometry. Our contribution is Collaborative parallel RIC (CPRIC) - a novel approach to parallelizing RIC for modern parallel architectures like vector processors and GPUs. We show that our approach based on a work-stealing mechanism avoids the control-flow divergence of parallel threads, thus improving the performance of parallel implementation. Our extensive experiments on CPU and GPU demonstrate the advantages of our CPRIC approach that achieves an average speedup between 4x and 5x compared to the naively parallelized RIC.

关键词： randomized algorithms control-flow divergence parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Multi-Threaded control of NAND Flash memory array 18

Multi-Threaded control of NAND Flash memory array

引用

IEEE Workshop on Microelectronics and Electron Devices (WMED)

作者： Nubile, Luca De Santis, Luca Cardinali, Riccardo Micron Semicond Italy Avezzano Italy

ISBN: (纸本)9781728162744

Flash memory devices operations like read, program and erase are performed by sequences called "algorithms". An algorithm is mainly composed of phases where accurate voltages are computed and applied to the array cells and phases where data are moved through different latches in data buffers. The fast-increasing complexity of multi-bit NAND memory algorithms made it natural to realize control logic on a microprocessor, in the form of firmware routines. The HW/FW co-design enables a safer complexity management, allows development of algorithms in parallel with physical design and allows acceleration of development time to fit aggressive time-to-market requirements. Again, to follow the increasingly performance requirements different kind of multi-thread microprocessor solutions have been proposed in the years to get the best performance-power-area (PPA) trade-off. This article proposes one possible approach to performance optimization by a multi-threaded approach, without an immediate downside for area and power. The most innovative point of the new architecture is the deep adherence between intrinsic paralleliz able physical processes inside a NAND Flash and the number of threads, busses and physical executors. As shown, this solution introduces tangible advantages in terms of performance.

关键词： Flash memories Integrated Circuit design Multitasking parallel algorithms Multi-Threading

来源：评论

学校读者我要写书评

暂无评论

Mesh generation and multi-scale simulation of a contracting muscle-tendon complex

引用

JOURNAL OF COMPUTATIONAL SCIENCE 2022年 59卷

作者： Maier, Benjamin Schulte, Miriam Univ Stuttgart Inst Parallel & Distributed Syst Univ Str 38 D-70569 Stuttgart Germany

The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parallel computations. We simulate electrophysiology and muscle contraction using a state-of-the-art, biophysical chemo-electro-mechanical model that requires meshes of the 3D domain with embedded, aligned 1D meshes for muscle fibers. We present novel algorithms to construct highly-resolved meshes with robust properties for real muscle geometries from surface triangulations. We demonstrate their use and suitability in a simulation of the biceps brachii muscle and tendons. In addition, the respective simulations showcase several functional enhancements of our simulation framework OpenDiHu.

关键词： Mesh generation Skeletal muscle mechanics Multi-scale modeling High-performance computing parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Distributed Domain Generation for Large-Scale Scientific Computing 20

Distributed Domain Generation for Large-Scale Scientific Com...

引用

20th International Symposium on parallel and Distributed Computing (ISPDC)

作者： Ertl, Christoph Mundani, Ralf-Peter Tech Univ Munich Chair Computat Modeling & Simulat Munich Germany Univ Appl Sci Grisons Swiss Inst Informat Sci Chur Switzerland

ISBN: (纸本)9781665432818

In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, which are defined by geometric and hierarchical relations between nodes that represent subsets of the computational domain. We generate this domain following a hierarchical spacetree refinement. First, an initial tree is generated on every participating process. Second, this tree is distributed following a space-filling curve linearisation locally. Every process is assigned at least one leaf node of the initial tree, which acts as a starting point for the subsequent domain generation. From here, every process independently refines a subdomain using a decomposition method, which transforms a triangular surface-based geometry description into a volume-based one, using increasingly complex intersection tests. The resulting domain tree is distributed, yet neighbourhood references of neighbouring subtrees are not resolved. We combine the resolution of these relations with a 2:1 tree balancing, which involves the transfer of the surface of neighbouring subtrees. We provide results of a domain generation testcase, using an input geometry with 84,072 triangles on up to 896 processes of the CoolMUC-2 cluster segment of LRZ's Linux Cluster System. Here, we bring down the overall time it takes to generate an adaptively refined and balanced octree with depth d = 7 from 5.5 hours on one process to two seconds on 896 processes.

关键词： Large-scale scientific computing parallel algorithms distributed algorithms distributed systems distributed domain generation spacetrees

来源：评论

学校读者我要写书评

暂无评论

parallelising Glauber dynamics

arXiv

引用

arXiv 2023年

作者： Lee, Holden

For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, which resamples a random subset of k coordinates, mixes k times faster in χ2-divergence, and assuming approximate tensorization of entropy, mixes k times faster in KL-divergence. We apply this to obtain parallel algorithms in two settings: (1) For the Ising model (Equation presented) with kJ k Copyright © 2023, The Authors. All rights reserved.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：