We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplic...
详细信息
We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms and inner products, orthonormalization, and rounding (rank truncation). These are the kernel operations for applications such as iterative Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-memory computation, and we propose a data distribution and strategy that parallelizes computations for individual cores within the TT format. We analyze the computation and communication costs of the proposed algorithms to show their scalability, and we present numerical experiments that demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For example, we observe better single-core performance than the existing MATLAB TT-Toolbox in rounding a 2GB TT tensor, and our implementation achieves a 34x speedup using all 40 cores of a single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores for all mathematical operations.
I've spent my career trying to make parallel algorithms accessible to the masses, working from the programming language, systems and algorithms sides. For much of this time, unfortunately, parallel machines were n...
详细信息
ISBN:
(纸本)9781450395458
I've spent my career trying to make parallel algorithms accessible to the masses, working from the programming language, systems and algorithms sides. For much of this time, unfortunately, parallel machines were not ready for prime time. They were expensive, hard to access, quirky and there was a lack of software support. parallel algorithms and programming were reserved for a small cadre of experts. However, with advances over the past fifteen or so years we have gone from a situation where all commodity machines had a single processor to one in which all but perhaps a toaster has multiple processors (cores), some with hundreds+. Given the state of modern machines one should wander whether we are at a point where parallel algorithms are ready for prime time and can replace sequential algorithms? Or perhaps a better question is whether we are at a point where algorithms should be algorithms, some with more parallelism than others? In the talk I argued that parallel algorithms are indeed ready for prime time. In particular that, they supply useful abstractions, are broadly applicable, support general techniques, lead to interesting theoretical questions, are elegant, are easy to program, rely on a simple cost model, and importantly can lead to good efficiency on modern multicore machines, very much more so than sequential algorithms. As some evidence, I described our experience implementing a set of 60+ parallel algorithms across a wide set of domains.(1) With the right abstractions and techniques, the code is rarely significantly more complicated than the sequential counterpart, and sometimes simpler. Furthermore the algorithms often get near perfect speedup relative to good sequential algorithms on a modern multicore. As such, it could be that the limiting factor in broadly adopting parallel algorithms is now more a social one rather than a technical one.
Analysis of parallel algorithms for graphics processors allows you to determine the bottlenecks of the algorithm that affect its performance on a particular computing system. algorithms can be analyzed both at the AGM...
详细信息
This study explores the application of parallel algorithms to enhance large-scale sorting, focusing on the QuickSort method. Implemented in both sequential and parallel forms, the paper provides a detailed comparison ...
详细信息
This study explores the application of parallel algorithms to enhance large-scale sorting, focusing on the QuickSort method. Implemented in both sequential and parallel forms, the paper provides a detailed comparison of their performance. This study investigates the efficacy of both techniques through the lens of array generation and pivot selection to manage datasets of varying sizes. This study meticulously documents the performance metrics, recording 16,499.2 milliseconds for the serial implementation and 16,339 milliseconds for the parallel implementation when sorting an array by using C++ chrono library. These results suggest that while the performance gains of the parallel approach over its serial counterpart are not immediately pronounced for smaller datasets, the benefits are expected to be more substantial as the dataset size increases.
Computer tomography has a wide field of applicability;however, most of its applications assume that the data, obtained from the scans of the examined object, satisfy the expectations regarding their amount and quality...
详细信息
Computer tomography has a wide field of applicability;however, most of its applications assume that the data, obtained from the scans of the examined object, satisfy the expectations regarding their amount and quality. Unfortunately, sometimes such expected data cannot be achieved. Then we deal with the incomplete set of data. In the paper we consider an unusual case of such situation, which may occur when the access to the examined object is difficult. The previous research, conducted by the author, showed that the CT algorithms can be used successfully in this case as well, but the time of reconstruction is problematic. One of possibilities to reduce the time of reconstruction consists in executing the parallel calculations. In the analyzed approach the system of linear equations is divided into blocks, such that each block is operated by a different thread. Such investigations were performed only theoretically till now. In the current paper the usefulness of the parallel-block approach, proposed by the author, is examined. The conducted research has shown that also for an incomplete data set in the analyzed algorithm it is possible to select optimal values of the reconstruction parameters. We can also obtain (for a given number of pixels) a reconstruction with a given maximum error. The paper indicates the differences between the classical and the examined problem of CT. The obtained results confirm that the real implementation of the parallel algorithm is also convergent, which means it is useful.
In hydrological modeling, the longest flow path is an important feature used to characterize a catchment. Many existing GIS platforms offer dedicated software tools for its identification and delineation, generally im...
详细信息
In hydrological modeling, the longest flow path is an important feature used to characterize a catchment. Many existing GIS platforms offer dedicated software tools for its identification and delineation, generally implementing methods based on searching through the flow direction data. Unfortunately, currently available algorithms for this task often turn out to be inefficient, especially when working with modern large datasets. Moreover, existing methods often rely on incorrect assumptions or perform calculations in a way that can lead to precision issues. In this work, new parallel algorithms were developed, tested and presented. Measurements show that two of the newly proposed implementations are able to identify the longest flow paths in significantly less time compared with other existing methods.
In this paper, we study the problem of maximizing a monotone normalized one-sided σ-smooth (OSS for short) function F (x), subject to a convex polytope (no need to downward-closed [1]). A function F (x) is one-sided ...
详细信息
Given a set of vectors X = {x1, . . ., xn} ⊂ Rd, the Euclidean max-cut problem asks to partition the vectors into two parts so as to maximize the sum of Euclidean distances which cross the partition. We design new alg...
详细信息
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree T, the SLD of T is a binary dendrogram that summarizes the n...
详细信息
This work investigates how graph characteristics affect the quality of derived graphs, specifically focusing on graph spanners. Graph spanners retain all vertices and a subset of edges while preserving shortest distan...
详细信息
ISBN:
(数字)9798331509118
ISBN:
(纸本)9798331509125
This work investigates how graph characteristics affect the quality of derived graphs, specifically focusing on graph spanners. Graph spanners retain all vertices and a subset of edges while preserving shortest distances with an allowable stretch, making them essential for efficiently approximating graph structures. We emphasize recent advancements in parallel algorithms for constructing spanners in sparse graphs, building on the work of Miller et al. and Forster et al. By extracting key graph properties and employing data analysis techniques—such as correlation analysis, linear regression, and random forest regression—we examine the relationships between these characteristics and the size of the derived graphs, which is vital for optimizing spanner construction in real-world applications.
暂无评论