作者:
Träff, Jesper LarssonTU Wien
Faculty of Informatics Institute of Computer Engineering Research Group Parallel Computing 191-4 Treitlstrasse 3 5th Floor Vienna1040 Austria
These lecture notes are designed to accompany an imaginary, virtual, undergraduate, one or two semester course on fundamentals of parallel Computing as well as to serve as background and reference for graduate courses...
详细信息
These lecture notes are designed to accompany an imaginary, virtual, undergraduate, one or two semester course on fundamentals of parallel Computing as well as to serve as background and reference for graduate courses on High-Performance Computing, parallel algorithms and shared-memory multiprocessor programming. They introduce theoretical concepts and tools for expressing, analyzing and judging parallel algorithms and, in detail, cover the two most widely used concrete frameworks OpenMP and MPI as well as the threading interface pthreads for writing parallel programs for either shared or distributed memory parallel computers with emphasis on general concepts and principles. Code examples are given in a C-like style and many are actual, correct C code. The lecture notes deliberately do not cover GPU architectures and GPU programming, but the general concerns, guidelines and principles (time, work, cost, efficiency, scalability, memory structure and bandwidth) will be just as relevant for efficiently utilizing various GPU architectures. Likewise, the lecture notes focus on deterministic algorithms only and do not use randomization. Slides or blackboard drawings are imagined to be worked out for the actual lectures by the lecturer, so the lecture notes deliberately do not provide such important visual aid: some is available from the author on request. Also the student of this material will find it instructive to take the time to understand concepts and algorithms visually. The exercises can be used for self-study and as inspiration for small implementation projects in OpenMP and MPI that can and should accompany any serious course on parallel Computing. The student will benefit from actually implementing and carefully benchmarking the suggested algorithms on the parallel computing system that may or should be made available as part of such a parallel Computing course. In class, the exercises can be used as basis for hand-ins and small programming projects for which su
Given multiple data sets, the problem of record linkage is to cluster them such that each cluster has all the information pertaining to a single entity and does not contain any other information. This problem has nume...
详细信息
In order to accurately and quickly find the network structure in big data, this paper proposes a big data clustering algorithm based on community maximal classes. To address the time consumption caused by the uncertai...
详细信息
ISBN:
(数字)9798331504205
ISBN:
(纸本)9798331504212
In order to accurately and quickly find the network structure in big data, this paper proposes a big data clustering algorithm based on community maximal classes. To address the time consumption caused by the uncertainty of initial nodes and the calculation of the fitness function, local key nodes are introduced and the fitness formula is improved to reduce the time consumption. For the formation of the initial community, the concept of maximal clique is introduced. By analyzing the characteristics of maximal cliques, it is concluded that the core category of the community is composed of maximal cliques. Meanwhile, a method to obtain local core categories through the discovery of maximal cliques is proposed, and a parallel strategy for the maximal clique discovery algorithm is put forward. Then, the parallel strategy of the whole algorithm is proposed and experiments are conducted on real datasets. The experimental results prove that the algorithm proposed in this paper is feasible and effective, and is applicable to the discovery of network structures in large-scale data.
Sparse matrix computations are an important class of algorithms. One of the important topics in this field is SPCA (Sparse Principal Component Analysis), a variant of PCA. SPCA is used to compute the principal compone...
详细信息
ISBN:
(数字)9798350376647
ISBN:
(纸本)9798350376654
Sparse matrix computations are an important class of algorithms. One of the important topics in this field is SPCA (Sparse Principal Component Analysis), a variant of PCA. SPCA is used to compute the principal components of a matrix. There are various methods for computing the sparse principal components of a dataset. One of them is the congradU (Conditional gradient algorithm with unit step size) method, which is an iterative approach. This method performs a matrix-vector multiplication at each iteration of its execution process. Therefore, we need to accelerate the multiplication operation. In this regard, we propose a parallel algorithm for the congradU method that uses a master/worker model to distribute the rows of the matrix among the cores or processors in a manner that ensures an appropriate workload distribution between them. By optimizing the workload distribution among processors, we can reduce the overall execution time of operations. The proposed algorithm has been tested on randomly generated matrices with different sizes and sparsity percentages. We compare the time to find the first principal component using the proposed algorithm and SVD algorithm. It was observed that by increasing the size and sparsity percentage of the matrix, the proposed algorithm finds the first principal component faster than the SVD algorithm. Also, we compare the time of the multiplication operation in one iteration of the proposed algorithm and the dot operator (in Python), and we observe that with increasing the percentage of sparsity, the proposed algorithm performs better than the dot operator.
Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SC...
详细信息
Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter *** tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each *** believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.
To solve the problem of high time cost in power flow calculation of urban rail transit traction power supply network, research on using an acceleration algorithm to reduce calculation time was carried out. As the comp...
详细信息
ISBN:
(数字)9798350390315
ISBN:
(纸本)9798350390322
To solve the problem of high time cost in power flow calculation of urban rail transit traction power supply network, research on using an acceleration algorithm to reduce calculation time was carried out. As the complexity and scale of urban rail transit traction power supply network systems continue to increase, traditional serial calculation methods can no longer meet actual needs. Using parallel algorithms for optimization has become an important solution to improve the speed and efficiency of power flow calculations. In this optimization algorithm, the traction power supply network data in the database is read in batches, and based on the train's running time, the MapReduce model and process pool are used to process and calculate the fragmented data in parallel. Finally, the results are merged and output. Such an optimization algorithm can make full use of the performance of parallel computing devices, such as multi-core CPUs while reducing the time of data transmission and processing, thereby effectively improving the speed and efficiency of power flow calculations. Experimental results show that the optimization effect of using this parallel algorithm is closely related to the number of parallel processes and CPU cores. The more cores are, the more obvious the optimization effect is. In the experiments of this article, the acceleration effect is optimal when the number of CPU cores is equal to the number of processes. The calculation time required after optimization is about 1/6 of that before optimization.
In this paper, we propose a collision detection algorithm for a dynamic simulation system. This algorithm first conducts global search for obj ects, optimizes global detection through spatial decomposition, and uses s...
详细信息
Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of th...
详细信息
Computational electromagnetics methods for analysing nonlinear systems are computationally complex, such as harmonic balance (HB) method, especially when dealing with a large number of frequency points. In this paper,...
详细信息
ISBN:
(数字)9798350351019
ISBN:
(纸本)9798350351026
Computational electromagnetics methods for analysing nonlinear systems are computationally complex, such as harmonic balance (HB) method, especially when dealing with a large number of frequency points. In this paper, we propose a fast parallel algorithm for HB method to accelerate electromagnetic simulation. The new algorithm parallelizes the construction of nonlinear Jacobian matrix, utilizing graphical processing unit (GPU) to realize improvements for electromagnetic simulation. We present the formulations of the parallel HB method, and subsequently provide its implementation details based on the mixed platform with GPU and CPU. Experimental results from several industrial cases illustrate that the new parallel algorithm leads to $3 \times$ speedup compared to the conventional HB method while still maintaining the similar accuracy, where the GPU-accelerated part is about 10 times faster than its CPU counterpart.
This paper presents a reexamination of the research paper titled "Communication-Avoiding parallel algorithms for TRSM" by Wicky et al. We focus on the communication bandwidth cost analysis presented in the o...
详细信息
暂无评论