Single linkage (SLINK) hierarchical clustering algorithm is a preferred clustering algorithm over traditional partitioning-based clustering as it does not require the number of clusters as input. But, due to its high ...
详细信息
ISBN:
(纸本)9781509036530
Single linkage (SLINK) hierarchical clustering algorithm is a preferred clustering algorithm over traditional partitioning-based clustering as it does not require the number of clusters as input. But, due to its high time complexity and inherent data dependencies, it does not scale well for large datasets. In this paper, we parallelize an efficient implementation of SLINK algorithm to leverage a commodity cluster of multicore workstations. We present, dGridSlink, a distributed algorithm, which outperforms the best existing parallel solution in literature for all the real datasets considered. We also propose a hybrid parallel algorithm hGridSLINK for a cluster of multicore nodes. The proposed parallel algorithms are scalable and can cluster (several) millions of data points efficiently, without compromising the quality of clustering.
Fortran coarrays have been used as an extension to the standard for over 20 years, mostly on Cray systems. Their appeal to users increased substantially when they were standardised in 2010. In this work we show that c...
详细信息
ISBN:
(纸本)9781509052141
Fortran coarrays have been used as an extension to the standard for over 20 years, mostly on Cray systems. Their appeal to users increased substantially when they were standardised in 2010. In this work we show that coarrays offer simple and intuitive data structures for 3D cellular automata (CA) modelling of material microstructures. We show how coarrays can be used together with an MPI finite element (FE) library to create a two-way concurrent hierarchical and scalable multi-scale CAFE deformation and fracture framework. Design of a coarray cellular automata microstructure evolution library CGPACK is described. A highly portable MPI FE library ParaFEM was used in this work. We show that independently CGPACK and ParaFEM programs can scale up well into tens of thousands of cores. Strong scaling of a hybrid ParaFEM/CGPACK MPI/coarray multi-scale framework was measured on an important solid mechanics practical example of a fracture of a steel round bar under tension. That program did not scale beyond 7 thousand cores. Excessive synchronisation might be one contributing factor to relatively poor scaling. Therefore we conclude with a comparative analysis of synchronisation requirements in MPI and coarray programs. Specific challenges of synchronising a coarray library are discussed.
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have ...
详细信息
ISBN:
(纸本)9781509052523
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have been proposed and implemented using sequential computing, alternative parallel solutions provide more suitable and high performance solutions to the same problems. In this paper, three parallelization strategies are proposed and implemented for a dynamic programming based cloud smoothing application, using both shared memory and non-shared memory approaches. The experiments are performed on NVIDIA GeForce GT750m and Tesla K20m, two GPU accelerators of Kepler architecture. Detailed performance analysis is presented on partition granularity at block and thread levels, memory access efficiency and computational complexity. The evaluations described show high approximation of results with high efficiency in the parallel implementations, and these strategies can be adopted in similar data analysis and processing applications.
We consider a routing problem with constraints. To solve this problem, we employ a variant of the dynamic programming method, where the significant part (that is, the part that matters in view of precedence constraint...
详细信息
ISBN:
(纸本)9783319449142;9783319449135
We consider a routing problem with constraints. To solve this problem, we employ a variant of the dynamic programming method, where the significant part (that is, the part that matters in view of precedence constraints) of the Bellman function is calculated by means of an independent calculations scheme. We propose a parallel implementation of the algorithm for a supercomputer, where the construction of position space layers for the hypothetical processors is conducted with use of discrete dynamic systems' apparatus.
In this research a parallel version of two existing algorithms that implement Maximum Likelihood Scale Invariant Map (MLHL-SIM) and Scale Invariant Map (SIM) is proposed. By using OpenMP to distribute the independent ...
详细信息
ISBN:
(纸本)9783319446363;9783319446356
In this research a parallel version of two existing algorithms that implement Maximum Likelihood Scale Invariant Map (MLHL-SIM) and Scale Invariant Map (SIM) is proposed. By using OpenMP to distribute the independent iterations of for-loops among the available threads, a significant reduction in the computation time for all the experiments is achieved. The higher the size of the considered map is, the higher the reduction of the computation time in the parallel algorithm is. So, for two given datasets, measured times are up to a 29.45% and a 36.21% of the sequential time for the MLHL-SIM algorithm. For the SIM algorithm it also reduces the computation time being a 42.09% and a 36.72% of the sequential version for the two datasets respectively. Results prove the improvement on the speed up of the parallel version.
In this era of Big Data, large graphs appear in many scientific domains. To extract the hidden knowledge/correlations in these graphs, novel methods need to be developed to analyse these graphs fast. In this paper, we...
详细信息
ISBN:
(纸本)9781509021406
In this era of Big Data, large graphs appear in many scientific domains. To extract the hidden knowledge/correlations in these graphs, novel methods need to be developed to analyse these graphs fast. In this paper, we present a unified framework of stochastic matrix-function estimators, which allows one to compute a subset of elements of the matrix f(A), where f is an arbitrary function and A is the adjacency matrix of the graph. The new framework has a computational cost proportional to the size of the subset, i.e. to obtain the diagonal of f(A) with matrix-size N, the computational cost is proportional to N contrary to the traditional N-3 from diagonalization. Furthermore, we will show that the new framework allows us to write implementations of the algorithm that scale naturally with the number of compute nodes and is easily ported to accelerators where the kernels perform very well.
Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages...
详细信息
ISBN:
(纸本)9781450340731
Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs simultaneously for computing k-vertex induced subgraph statistics (called graphlets). In addition to the hybrid multi-core CPU-GPU framework, we also investigate single GPU methods (using multiple cores) and multi-GPU methods that leverage all available GPUs simultaneously for computing induced subgraph statistics. Both methods leverage GPU devices only, whereas the hybrid multi-core CPU-GPU framework leverages all available multi-core CPUs and multiple GPUs for computing graphlets in large networks. Compared to recent approaches, our methods are orders of magnitude faster, while also more cost effective enjoying superior performance per capita and per watt. In particular, the methods are up to 300 times faster than a recent state-of-the-art method. To the best of our knowledge, this is the first work to leverage multiple CPUs and GPUs simultaneously for computing induced subgraph statistics.
The processing of graphs is of increasing importance in many applications, with the size of such graphs growing rapidly. As with scientific computing, there is a growing need to understand the relationship between sys...
详细信息
ISBN:
(纸本)9781509036820
The processing of graphs is of increasing importance in many applications, with the size of such graphs growing rapidly. As with scientific computing, there is a growing need to understand the relationship between system architectures and graph algorithms, especially as both the scale of the system and the size of the graph increase. To date there is one such graph benchmark that has several hundred comparative reports available, namely Breadth First Search, which has over the last few years fueled new algorithms that have improved typical performance very significantly. This paper suggests an additional benchmark based on the computation of neighborhoods and Jaccard coefficients that is of both a different intrinsic complexity and can be recast in multiple ways that may be suitable for different classes of real-world applications.
Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increas...
详细信息
ISBN:
(纸本)9781450341219
Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk a C/C++ language extension that allows programmers to annotate memory bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3 x
We introduce the sparsified Cholesky and sparsified multigrid algorithms for solving systems of linear equations. These algorithms accelerate Gaussian elimination by sparsifying the nonzero matrix entries created by t...
详细信息
ISBN:
(纸本)9781450341325
We introduce the sparsified Cholesky and sparsified multigrid algorithms for solving systems of linear equations. These algorithms accelerate Gaussian elimination by sparsifying the nonzero matrix entries created by the elimination process. We use these new algorithms to derive the first nearly linear time algorithms for solving systems of equations in connection Laplacians a generalization of Laplacian matrices that arise in many problems in image and signal processing. We also prove that every connection Laplacian has a linear sized approximate inverse. This is an LU factorization with a linear number of nonzero entries that is a strong approximation of the original matrix. Using such a factorization one can solve systems of equations in a connection Laplacian in linear time. Such a factorization was unknown even for ordinary graph Laplacians.
暂无评论