检索结果-内蒙古大学图书馆

A Native Tensor-Vector Multiplication Algorithm for High Performance Computing

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2022年第12期33卷 3363-3374页

作者： Martinez-Ferrer, Pedro J. Yzelman, A. N. Beltran, Vicenc Barcelona Supercomp Ctr BSC Barcelona 08034 Spain Univ Politecn Catalunya UPC Barcelona 08034 Spain Huawei Technol Switzerland AG Comp Syst Lab CH-3097 Zurich Switzerland

Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This article proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.

关键词： Tensors Kernel Libraries Bandwidth Virtual machine monitors Layout Benchmark testing parallel algorithms shared memory tensor computations high bandwidth memory NUMA

来源：评论

学校读者我要写书评

暂无评论

Triangle Counting Through Cover-Edges

Triangle Counting Through Cover-Edges

引用

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Bader, David A. Li, Fuhuan Ganeshan, Anya Gundogdu, Ahmet Lew, Jason Rodriguez, Oliver Alvarado Du, Zhihui New Jersey Inst Technol Dept Data Sci Newark NJ 07102 USA

ISBN: (纸本)9798350308600

Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. We use a breadth-first search (BFS) to quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms are presented that employ cover-edge sets. The sequential algorithm avoids unnecessary triangle-checking operations, and the parallel algorithm is communication-efficient. The parallel algorithm can asymptotically reduce communication on massive graphs such as from real social networks and synthetic graphs from the Graph500 Benchmark. In our estimate from massive-scale Graph500 graphs, our new parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x.

关键词： Graph algorithms Triangle Counting parallel algorithms High Performance Data Analytics

来源：评论

学校读者我要写书评

暂无评论

Fast, parallel, and cache-friendly suffix array construction

引用

algorithms FOR MOLECULAR BIOLOGY 2024年第1期19卷 16页

作者： Khan, Jamshed Rubel, Tobias Molloy, Erin Dhulipala, Laxman Patro, Rob Univ Maryland Dept Comp Sci College Pk MD 20742 USA

Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and *** In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache *** We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://***/jamshed/CaPS-SA.

关键词： Suffix array Longest common prefix Data structures Indexing parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

VDS: A Variant of Δ-stepping Algorithm for parallel SSSP Problem

VDS: A Variant of Δ-stepping Algorithm for Parallel SSSP Pr...

引用

2022 IEEE International Conference on Data Science and Information System, ICDSIS 2022

作者： Kumar, Praveen Singh, Anil Kumar Mnnit Allahabad Computer Science and Engineering Department Prayagraj India

ISBN: (数字)9781665498012

ISBN: (纸本)9781665498012

Δ-stepping is a famous parallel algorithm for the single-source shortest path problem. It requires a tuning parameter (delta) to achieve a good trade-off between parallelism and work efficiency. The performance of Δ-stepping changes drastically with the changing value of delta. A poor choice of delta leads to an inefficient Δ-stepping algorithm. For large graphs, finding the best-performing value of delta is difficult. This paper proposes a variant of the Δ-stepping algorithm (VDS). We have evaluated the proposed algorithm on graph500 data sets. Our results show that the proposed algorithm is equally work-efficient and scalable compared to Δ-stepping, and its performance remains almost stable with the changing value of delta. Against the best performing value of delta, VDS's performance on different deltas varies up to 136%, whereas Δ-stepping's performance varies up to 430%. For the best performing value of delta, the proposed algorithm is competitive or slightly efficient compared to the Δ-stepping. And, for the most inefficient delta, the proposed algorithm is 2.8-3.6x faster than the Δ-stepping. © 2022 IEEE.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

A parallelized Self-Driving Vehicle Controller Using Unsupervised Machine Learning

引用

IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS 2022年第4期58卷 5148-5156页

作者： Abegaz, Brook W. Loyola Univ Dept Engn Chicago IL 60626 USA

In this article, a self-driving vehicle controller that optimizes the path a vehicle follows from its initial position to its destination is presented. The methods include clustering-based k-means, hierarchical, Gaussian matrix model, and self-organizing mapping. The real-time parallel implementation of the unsupervised machine learning algorithms could provide fast response times of under one microsecond during the lateral, longitudinal, and angular motion control of the autonomous vehicle. It was observed that a random selection of one of the machine learning methods may not always guarantee the optimality of the position and velocity variables as compared to the desired values. The proposed parallel implementation and optimization of the algorithms could have a significant contribution towards making transportation mobility more reliable and sustainable for future vehicular systems.

关键词： Machine learning Sensors Friction Wheels Torque Unified modeling language Tires Autonomous vehicles (AV) parallel algorithms robot operating system (ROS) unsupervised machine learning (UML)

来源：评论

学校读者我要写书评

暂无评论

Performance portable Vlasov code with C++ parallel algorithm 5

Performance portable Vlasov code with C++ parallel algorithm

引用

5th IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC 2022

作者： Asahi, Yuuichi Padioleau, Thomas Latu, Guillaume Bigot, Julien Grandgirard, Virginie Obrejan, Kevin Japan Atomic Energy Agency Ccse Chiba Japan Maison de la Simulation Université Paris-Saclay Uvsq Cnrs Cea Gif-sur-Yvette France Cea DES/IRESNE/DEC St.Paul-lez-Durance France Cea Irfm St.Paul-lez-Durance France

ISBN: (纸本)9781665460217

This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20 % to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future. © 2022 IEEE.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Design of GPU parallel Algorithm for Landscape Index Based on Virtual Reality Technology

Design of GPU Parallel Algorithm for Landscape Index Based o...

引用

2022 IEEE International Conference on Knowledge Engineering and Communication Systems, ICKES 2022

作者： Fu, Yi Bao, Runbo Shenyang Jianzhu University Shenyang China

ISBN: (纸本)9781665456371

With the rapid development of society and the continuous improvement of science and technology, people have higher and higher requirements for the quality of life. At the same time, they have put forward higher, stricter and more scientific requirements for the landscape index. Its main purpose is to simulate real scenes, and to model and analyze the simulation objects through virtual reality technology(VRT). Landscape index parallel planning based on VRT is an important research direction under the future development trend. This paper summarizes the existing achievements of various disciplines in the VR field through the analysis of related topics at home and abroad, and establishes a GPU parallel algorithm platform based on real data. The virtual model is used to combine the real scene with the virtual environment to reflect the visual simulation. Finally, according to the weight coefficient, the optimal planning scheme and the optimal design landscape index are calculated to verify the feasibility and effectiveness of the algorithm. The test results show that the running time of the GPU parallel algorithm fluctuates up and down in 20 seconds, the running efficiency of the algorithm is as high as 90%, and the parallel capacity is about 3500k. This shows that the performance of the CPU parallel algorithm meets user requirements. © 2022 IEEE.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

On Network Reliability Evaluation by Monte Carlo Method Using High-Performance Computing

引用

LOBACHEVSKII JOURNAL OF MATHEMATICS 2023年第8期44卷 3122-3129页

作者： Migov, D. A. Weins, D. V. Novosibirsk State Tech Univ Novosibirsk 630073 Russia Russian Acad Sci Inst Computat Math & Math Geophys Siberian Branch Novosibirsk 630090 Russia

The paper considers the NP-hard problem of calculation the reliability of a network, which elements are subject to accidental failures. As network reliability, we mean the probabilistic connectivity of a random graph with unreliable edges. To evaluate the reliability of a network, a parallel Monte Carlo method is used, improved by checking the connectivity of a particular graph realization simultaneously with the generation of this realization. Based of multi-agent simulation, we study the scalability of this algorithm and tune the parameters for an execution using high-performance supercomputers.

关键词： network reliability parallel algorithms random graph connectivity Monte Carlo methods MPI multi-agent simulation

来源：评论

学校读者我要写书评

暂无评论

Scalability analysis of a two level domain decomposition approach in space and time solving data assimilation models

Scalability analysis of a two level domain decomposition app...

引用

作者： Cacciapuoti, Rosalba D'Amore, Luisa Department of Mathematics and Applications Renato Caccioppoli University of Naples Federico II Naples Italy

We are concerned with the mapping on high performance hybrid architectures of a parallel software implementing a two level overlapping domain decomposition, that is, along space and time directions, of the four dimensional variational data assimilation model. The reference architecture belongs to the SCoPE (Sistema Cooperativo Per Elaborazioni scientifiche multidisciplinari) data center, located at University of Naples Federico II. We consider the initial boundary problem of the shallow water equation and analyse both strong and weak scaling. Keeping the efficiency always greater than (Formula presented.) and about (Formula presented.) in most cases, we experimentally find that the isoefficiency function grows a little more than linearly with respect to the number of processes. Results, obtained by using the parallel computing toolbox of MATLABR2013a, are in agreement with the algorithm's performance prevision based on the scale up factor, confirming the appropriate mapping of the algorithm on the hybrid architecture. © 2023 The Authors. Concurrency and Computation: Practice and Experience published by John Wiley & Sons Ltd.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Fractional Linear Matroid Matching Is in Quasi-NC 32

Fractional Linear Matroid Matching Is in Quasi-NC

引用

32nd Annual European Symposium on algorithms, ESA 2024

作者： Gurjar, Rohit Oki, Taihei Raj, Roshan Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai India Department of Mathematical Informatics Graduate School of Information Science and Technology The University of Tokyo Japan

ISBN: (纸本)9783959773386

The matching and linear matroid intersection problems are solvable in quasi-NC, meaning that there exist deterministic algorithms that run in polylogarithmic time and use quasi-polynomially many parallel processors. However, such a parallel algorithm is unknown for linear matroid matching, which generalizes both of these problems. In this work, we propose a quasi-NC algorithm for fractional linear matroid matching, which is a relaxation of linear matroid matching and commonly generalizes fractional matching and linear matroid intersection. Our algorithm builds upon the connection of fractional matroid matching to non-commutative Edmonds' problem recently revealed by Oki and Soma (2023). As a corollary, we also solve black-box non-commutative Edmonds' problem with rank-two skew-symmetric coefficients. © Rohit Gurjar, Taihei Oki, and Roshan Raj;licensed under Creative Commons License CC-BY 4.0.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：