Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operat...
详细信息
Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This article proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.
Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can ...
详细信息
ISBN:
(纸本)9798350308600
Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. We use a breadth-first search (BFS) to quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms are presented that employ cover-edge sets. The sequential algorithm avoids unnecessary triangle-checking operations, and the parallel algorithm is communication-efficient. The parallel algorithm can asymptotically reduce communication on massive graphs such as from real social networks and synthetic graphs from the Graph500 Benchmark. In our estimate from massive-scale Graph500 graphs, our new parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x.
Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importanc...
详细信息
Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and *** In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache *** We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://***/jamshed/CaPS-SA.
Δ-stepping is a famous parallel algorithm for the single-source shortest path problem. It requires a tuning parameter (delta) to achieve a good trade-off between parallelism and work efficiency. The performance of Δ...
详细信息
In this article, a self-driving vehicle controller that optimizes the path a vehicle follows from its initial position to its destination is presented. The methods include clustering-based k-means, hierarchical, Gauss...
详细信息
In this article, a self-driving vehicle controller that optimizes the path a vehicle follows from its initial position to its destination is presented. The methods include clustering-based k-means, hierarchical, Gaussian matrix model, and self-organizing mapping. The real-time parallel implementation of the unsupervised machine learning algorithms could provide fast response times of under one microsecond during the lateral, longitudinal, and angular motion control of the autonomous vehicle. It was observed that a random selection of one of the machine learning methods may not always guarantee the optimality of the position and velocity variables as compared to the desired values. The proposed parallel implementation and optimization of the algorithms could have a significant contribution towards making transportation mobility more reliable and sustainable for future vehicular systems.
This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and...
详细信息
With the rapid development of society and the continuous improvement of science and technology, people have higher and higher requirements for the quality of life. At the same time, they have put forward higher, stric...
详细信息
The paper considers the NP-hard problem of calculation the reliability of a network, which elements are subject to accidental failures. As network reliability, we mean the probabilistic connectivity of a random graph ...
详细信息
The paper considers the NP-hard problem of calculation the reliability of a network, which elements are subject to accidental failures. As network reliability, we mean the probabilistic connectivity of a random graph with unreliable edges. To evaluate the reliability of a network, a parallel Monte Carlo method is used, improved by checking the connectivity of a particular graph realization simultaneously with the generation of this realization. Based of multi-agent simulation, we study the scalability of this algorithm and tune the parameters for an execution using high-performance supercomputers.
We are concerned with the mapping on high performance hybrid architectures of a parallel software implementing a two level overlapping domain decomposition, that is, along space and time directions, of the four dimens...
详细信息
The matching and linear matroid intersection problems are solvable in quasi-NC, meaning that there exist deterministic algorithms that run in polylogarithmic time and use quasi-polynomially many parallel processors. H...
详细信息
暂无评论