The Tucker tensor decomposition is a natural extension of the singular value decomposition (SVD) to multiway data. We propose to accelerate Tucker tensor decomposition algorithms by using randomization and paralleliza...
详细信息
The Tucker tensor decomposition is a natural extension of the singular value decomposition (SVD) to multiway data. We propose to accelerate Tucker tensor decomposition algorithms by using randomization and parallelization. We present two algorithms that scale to large data and many processors, significantly reduce both computation and communication cost compared to previous deterministic and randomized approaches, and obtain nearly the same approximation errors. The key idea in our algorithms is to perform randomized sketches with Kronecker-structured random matrices, which reduces computation compared to unstructured matrices and can be implemented using a fundamental tensor computational kernel. We provide probabilistic error analysis of our algorithms and implement a new parallel algorithm for the structured randomized sketch. Our experimental results demonstrate that our combination of randomization and parallelization achieves accurate Tucker decompositions much faster than alternative approaches. We observe up to a 16X speedup over the fastest deterministic parallel implementation on 3D simulation data.
Most tensor decomposition algorithms were developed for in-memory computation on a single machine. There are a few recent exceptions that were designed for parallel and distributed computation, but these cannot easily...
详细信息
We investigate an efficient parallelization of a class of algorithms for the well-known Tucker decomposition of general N-dimensional sparse tensors. The targeted algorithms are iterative and use the alternating least...
详细信息
We investigate an efficient parallelization of a class of algorithms for the well-known Tucker decomposition of general N-dimensional sparse tensors. The targeted algorithms are iterative and use the alternating least squares method. At each iteration, for each dimension of an N-dimensional input tensor, the following operations are performed: (i) the tensor is multiplied with (N - 1) matrices (TTMc step), (ii) the product is then converted to a matrix, and (iii) a few leading left singular vectors of the resulting matrix are computed (TRSVD step) to update one of the matrices for the next TTMc step. We propose an efficient parallelization of these algorithms for the current parallel platforms with multicore nodes. We discuss a set of preprocessing steps which takes all computational decisions out of the main iteration of the algorithm and provides an intuitive shared-memory parallelism for the TTM and TRSVD steps. We propose a coarse and a fine-grain parallel algorithm in a distributed memory environment, investigate data dependencies, and identify efficient communication schemes. We demonstrate how the computation of singular vectors in the TRSVD step can be carried out efficiently following the TTMc step. Finally, we develop a hybrid MPI-OpenMP implementation of the overall algorithm and report scalability results on up to 4096 cores on 256 nodes of an IBM BlueGene/Q supercomputer.
beta-skeletons, prominent members of the neighborhood graph family, have interesting geometric properties and various applications ranging from geographic networks to archeology. This paper focuses on computing the be...
详细信息
beta-skeletons, prominent members of the neighborhood graph family, have interesting geometric properties and various applications ranging from geographic networks to archeology. This paper focuses on computing the beta-spectrum, a labeling of the edges of the Delaunay triangulation, DT(V), which makes it possible to quickly find the lune-ased beta-skeleton of V for any query value beta is an element of [1,2]. We consider planar n-point sets V with L-p metric, 1 < p < infinity. We present an O (n log(2) n) time sequential, and an O (log(4) n) time parallel, beta-spectrum labeling. We also show a parallel algorithm, which for a given beta is an element of [1,2] finds the lune-based beta-skeleton in O (log(2) n) time. The parallel algorithms use O(n) processors in the CREW-PRAM model. (C) 2015 Elsevier B.V. All rights reserved.
The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently se...
详细信息
ISBN:
(纸本)9781509028245
The k-center problem is a classic NP-hard clustering question. For contemporary massive data sets, RAM-based algorithms become impractical. Although there exist good algorithms for k-center, they are all inherently sequential. In this paper, we design and implement parallel approximation algorithms for k-center. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds, in practice, we find that two rounds are sufficient, leading to a 4-approximation. In practice, we find this parallel scheme is about 100 times faster than the sequential Gonzalez algorithm, and barely compromises solution quality. We contrast this with an existing parallel algorithm for k-center that offers a 10-approximation. Our analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to input size. In practice, it is slightly more effective than Gonzalez's approach, but is slow. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm. We prove a lower bound on the parameter for effectiveness, and find experimentally that with values even lower than the bound, the algorithm is not only faster, but sometimes more effective.
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the a...
详细信息
ISBN:
(纸本)9781509058983
Mapping parallel algorithms to parallel computing platforms requires several activities such as the analysis of the parallel algorithm, the definition of the logical configuration of the platform, the mapping of the algorithm to the logical configuration platform and the implementation of the source code. Applying this process from scratch for each parallel algorithm is usually time consuming and cumbersome. Moreover, for large platforms this overall process becomes intractable for the human engineer. To support systematic reuse we propose to adopt a model-driven product line engineering approach for mapping parallel algorithms to parallel computing platforms. Using model-driven transformation patterns we support the generation of logical configurations of the computing platform and the generation of the parallel source code that runs on the parallel computing platform nodes. The overall approach is illustrated for mapping an example parallel algorithm to parallel computing platforms.
Analyzing large dynamic networks is an important problem with applications in a wide range of disciplines. A key operation is updating the network properties as its topology changes. In this paper we present graph spa...
详细信息
ISBN:
(纸本)9781509036837
Analyzing large dynamic networks is an important problem with applications in a wide range of disciplines. A key operation is updating the network properties as its topology changes. In this paper we present graph sparsification as an efficient abstraction for updating the properties of dynamic networks. We demonstrate the applicability of graph sparsification in updating the connected components in random and scale-free networks on shared memory systems. Our results show that the updating is scalable (10X on 16 processors for larger networks). To the best of our knowledge this is the first parallel implementation of graph sparsification. Based on these initial results, we discuss how the current implementation can be further improved and how graph sparsification can be applied to updating other network properties.
Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the ...
详细信息
ISBN:
(纸本)9781509054121
Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the number of dimensions increases and as a consequence, distance or density based clustering algorithms become less meaningful. Shared Nearest Neighbor (SNN) is a solution to clustering high-dimensional data with the ability to find clusters of varying density. SNN assigns objects to a cluster, which share a large number of their nearest neighbors. However, SNN is compute and memory intensive for data of large size and/or dimensionality. Nearest neighbor queries are responsible for a major proportion of computations in SNN, resulting in lower efficiency for higher value of number of nearest neighbors (k). The main motivation of this work is to improve the efficiency of SNN and to parallelize it so that it can be used for clustering large high-dimensional datasets and for large values of k. Existing SNN algorithms become inefficient in these situations. In this paper, we present a new sequential SNN algorithm, R-SNN, which uses R-tree for executing neighborhood queries efficiently and exploiting spatial locality to minimize memory usage. R-SNN is benchmarked against the best available implementation of SNN and is found up to 77 times faster when tested on various real datasets. R-SNN is parallelized for distributed memory, shared memory, and hybrid systems. Significant speedup and scalability achieved can be attributed to parallelization and good load balancing strategies and also to exploitation of spatial locality. Experimental results demonstrate the same for datasets of varying dimensionality and size. The maximum speedup achieved for shared, distributed, and hybrid models are 427.19 using 48 threads, 394.24 using 32 processes, and 1380.69 on 32 nodes (with each node spawning 4 threads), respectively. Super-linear speedup for some datasets is attributed
A program that is working inefficiently leads to inevitable losses in computer performance. These losses should be avoided or at least minimized. In order to do this we need to apply the approved research and developm...
详细信息
ISBN:
(纸本)9781467389204
A program that is working inefficiently leads to inevitable losses in computer performance. These losses should be avoided or at least minimized. In order to do this we need to apply the approved research and development techniques and also equivalent algorithm conversions. In the present paper we suggest a technique of parallel algorithm modification with the aim of improving its efficiency due to balanced load of processors. The technique itself consists in rearranging processes among processors and enlargement of algorithm operations. All information dependencies of an algorithm are preserved and algorithm performance time and the number of processors involved can only be reduced.
Minimum cut/maximum flow (min-cut/max-flow) algorithms solve a variety of problems in computer vision and thus significant effort has been put into developing fast min-cut/max-flow algorithms. As a result, it is diffi...
详细信息
Minimum cut/maximum flow (min-cut/max-flow) algorithms solve a variety of problems in computer vision and thus significant effort has been put into developing fast min-cut/max-flow algorithms. As a result, it is difficult to choose an ideal algorithm for a given problem. Furthermore, parallel algorithms have not been thoroughly compared. In this paper, we evaluate the state-of-the-art serial and parallel min-cut/max-flow algorithms on the largest set of computer vision problems yet. We focus on generic algorithms, i.e., for unstructured graphs, but also compare with the specialized GridCut implementation. When applicable, GridCut performs best. Otherwise, the two pseudoflow algorithms, Hochbaum pseudoflow and excesses incremental breadth first search, achieves the overall best performance. The most memory efficient implementation tested is the Boykov-Kolmogorov algorithm. Amongst generic parallel algorithms, we find the bottom-up merging approach by Liu and Sun to be best, but no method is dominant. Of the generic parallel methods, only the parallel preflow push-relabel algorithm is able to efficiently scale with many processors across problem sizes, and no generic parallel method consistently outperforms serial algorithms. Finally, we provide and evaluate strategies for algorithm selection to obtain good expected performance. We make our dataset and implementations publicly available for further research.
暂无评论