We introduce a new highly parallel and memory efficient deformable image registration algorithm to handle challenging clinical applications. The algorithm is based on the normalized gradient fields (NGF) distance meas...
详细信息
ISBN:
(纸本)9781479923748
We introduce a new highly parallel and memory efficient deformable image registration algorithm to handle challenging clinical applications. The algorithm is based on the normalized gradient fields (NGF) distance measure and Gauss-Newton numerical optimization. By carefully analyzing the mathematical structure of the problem, a matrix-free Hessian-vector multiplication for NGF is derived, giving a highly integrated formulation. Embedding the new scheme in a full, non-linear image registration algorithm enables fast calculations on high resolutions with dramatically reduced memory consumption. The new approach provides linear scalability compared with a traditional sparse-matrix-based scheme. The algorithm is evaluated on a challenging problem from radiotherapy, where pelvis cone-beam CT and planning CT images are registered. Speedups up to a factor of 149.3 for a single Hessian-vector multiplication and of 20.3 for a complete non-linear registration are achieved.
We propose introducing modern parallel programming paradigms to secure computation, enabling their secure execution on large datasets. To address this challenge, we present GraphSC, a framework that (i) provides a pro...
详细信息
ISBN:
(纸本)9781467369497
We propose introducing modern parallel programming paradigms to secure computation, enabling their secure execution on large datasets. To address this challenge, we present GraphSC, a framework that (i) provides a programming paradigm that allows non-cryptography experts to write secure code;(ii) brings parallelism to such secure implementations;and (iii) meets the needs for obliviousness, thereby not leaking any private information. Using GraphSC, developers can efficiently implement an oblivious version of graph-based algorithms (including sophisticated data mining and machine learning algorithms) that execute in parallel with minimal communication overhead. Importantly, our secure version of graph-based algorithms incurs a small logarithmic overhead in comparison with the non-secure parallel version. We build GraphSC and demonstrate, using several algorithms as examples, that secure computation can be brought into the realm of practicality for big data analysis. Our secure matrix factorization implementation can process 1 million ratings in 13 hours, which is a multiple order-of-magnitude improvement over the only other existing attempt, which requires 3 hours to process 16K ratings.
We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative ...
详细信息
ISBN:
(纸本)9781479980062
We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. In this context, recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. However, due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts in meeting the constraints of real-time social media stream clustering through parallelization in Cloud DIKW. Specifically, we focus on two system-level issues. Firstly, most stream processing engines such as Apache Storm organize distributed workers in the form of a directed acyclic graph (DAG), which makes it difficult to dynamically synchronize the state of parallel clustering workers. We tackle this challenge by creating a separate synchronization channel using a pub-sub messaging system (ActiveMQ in our case). Secondly, due to the sparsity of the high-dimensional vectors, the size of centroids grows quickly as new data points are assigned to the clusters. As a result, traditional synchronization that directly broadcasts cluster centroids becomes too expensive and limits the scalability of the parallel algorithm. We address this problem by communicating only dynamic changes of the clusters rather than the whole centroid vectors. Our algorithm under Cloud DIKW can process the Twitter 10% data stream ("gardenhose") in real-time with 96-way parallelism. By natural improvements to Cloud DIKW, including advanced collective communication techniques developed in our Harp project, we will be able to process the full Twitter data stream in real-time with 1000-way parallelism. Our use of powerful gene
Finding the number of triangles in a graph (network) is an important problem in graph analysis. The number of triangles also has important applications in graph mining. Big graphs emerging from numerous application ar...
详细信息
ISBN:
(纸本)9781479999255
Finding the number of triangles in a graph (network) is an important problem in graph analysis. The number of triangles also has important applications in graph mining. Big graphs emerging from numerous application areas pose a significant challenge for the analysis and mining since these graphs consist of millions, or even billions, of nodes and edges. Graphs of such scale necessitate the development of efficient parallel algorithms. Existing distributed memory parallel algorithms for counting exact triangles are either Map-Reduce or message passing interface (MPI) based. Map-Reduce based algorithms generate prohibitively large intermediate data and do not demonstrate reasonably good runtime efficiency. The MPI based algorithms offer fast computation of the number of triangles. However, the partitioning and load balancing schemes these algorithms employ are static in nature- the partitions are precomputed based on some estimations. In this paper, we present an efficient MPI-based parallel algorithm for counting triangles in large graph. We consider the case where the main memory of each compute node is large enough to contain the entire graph. We observe that for such a case, computation load can be balanced dynamically and present a dynamic load balancing scheme which improves the performance of the algorithm significantly. Our algorithm demonstrates very good speedups and scales to a large number of processors. The algorithm computes the exact number of triangles in a network with 1 billion edges in 2 minutes with only 100 processors. Our results demonstrate that the algorithm is significantly faster than the related algorithms with static partitioning. In fact, for the real-world networks we experimented on, our algorithm achieves at least 2 times runtime efficiency over the fastest algorithm with static load balancing.
Semisorting is the problem of reordering an input array of keys such that equal keys are contiguous but different keys are not necessarily in sorted order. Semisorting is important for collecting equal values and is w...
详细信息
ISBN:
(纸本)9781450335881
Semisorting is the problem of reordering an input array of keys such that equal keys are contiguous but different keys are not necessarily in sorted order. Semisorting is important for collecting equal values and is widely used in practice. For example, it is the core of the MapReduce paradigm, is a key component of the database join operation, and has many other applications. We describe a (randomized) parallel algorithm for the problem that is theoretically efficient (linear work and logarithmic depth), but is designed to be more practically efficient than previous algorithms. We use ideas from the parallel integer sorting algorithm of Rajasekaran and Reif, but instead of processing bits of a integers in a reduced range in a bottom-up fashion, we process the hashed values of keys directly top-down. We implement the algorithm and experimentally show on a variety of input distributions that it outperforms a similarly-optimized radix sort on a modern 40-core machine with hyper-threading by about a factor of 1.7-1.9, and achieves a parallel speedup of up to 38x. We discuss the various optimizations used in our implementation and present an extensive experimental analysis of its performance.
GPUs have been gaining acceptance in the electronic design automation field as attractive platforms for implementing and accelerating computationally extensive applications. Researchers agree that it is critical that ...
详细信息
ISBN:
(纸本)9781479988280
GPUs have been gaining acceptance in the electronic design automation field as attractive platforms for implementing and accelerating computationally extensive applications. Researchers agree that it is critical that EDA algorithms exploit future platforms and explore the use of parallel algorithms as we move to the manycore era. This paper describes the implementation of the TimberWolf placement algorithm using CUDA and demonstrates the applicability of GPUs in accelerating electronic design automation tools. The algorithm has been implemented on a Xeon Workstation using C, and achieved a substantial acceleration on an Nvidia Tesla C2070 card.
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and fas...
详细信息
ISBN:
(纸本)9781467376846
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and faster in the parallel setting. We use this property to first decompose the graph into independent and maximal 2-edge-connected subgraphs. To identify the articulation points in these 2-edge connected subgraphs, we again convert this into a problem of finding the bridges on an auxiliary graph. It is interesting to note that during the conversion process, the size of the graph may increase. However, we show that this small increase in size and the run time is offset by the consideration that finding bridges is easier in a parallel setting. We implement our algorithm on an Intel i7 980X CPU running 12 threads. We show that our algorithm is on average 2.45x faster than the best known current algorithms implemented on the same platform.
We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncom...
详细信息
ISBN:
(纸本)9781479984305
We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs. We integrate the compression techniques into Ligra, a recent shared-memory graph processing system. This system, which we call Ligra+, is able to represent graphs using about half of the space for the uncompressed graphs on average. Furthermore, Ligra+ is slightly faster than Ligra on average on a 40-core machine with hyper-threading. Our experimental study shows that Ligra+ is able to process graphs using less memory, while performing as well as or faster than Ligra.
Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in networ...
详细信息
ISBN:
(纸本)9781479989379
Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of nodes and edges. Such massive networks necessitate the development of efficient parallel algorithms. There exist several MapReduce and an only MPI (Message Passing Interface) based distributed-memory parallel algorithms for counting triangles. MapReduce based algorithms generate prohibitively large intermediate data. The MPI based algorithm can work on quite large networks, however, the overlapping partitions employed by the algorithm limit its capability to deal with very massive networks. In this paper, we present a space-efficient MPI based parallel algorithm for counting exact number of triangles in massive networks. The algorithm divides the network into non-overlapping partitions. Our results demonstrate up to 25-fold space saving over the algorithm with overlapping partitions. This space efficiency allows the algorithm to deal with networks which are 25 times larger. We present a novel approach that reduces communication cost drastically (up to 90%) leading to both a space-and runtime-efficient algorithm. Our adaptation of a parallel partitioning scheme by computing a novel weight function adds further to the efficiency of the algorithm. Denoting average degree of nodes and the number of partitions by (d) over bar and P, respectively, our algorithm achieves up to O(P-2)-factor space efficiency over existing MapReduce based algorithms and up to (d) over bar -factor (approx.) over the algorithm with overlapping partitioning.
A compressed suffix tree usually consists of three components: a compressed suffix array, a compressed LCP-array, and a succinct representation of the suffix tree topology. There are parallel algorithms that construct...
详细信息
ISBN:
(纸本)9783319238265;9783319238258
A compressed suffix tree usually consists of three components: a compressed suffix array, a compressed LCP-array, and a succinct representation of the suffix tree topology. There are parallel algorithms that construct the suffix array and the LCP-array, but none for the third component. In this paper, we present parallel algorithms on shared memory architectures that construct the enhanced balanced parentheses representation (BPR). The enhanced BPR is an implicit succinct representation of the suffix tree topology, which supports all navigational operations on the suffix tree. It can also be used to efficiently construct the BPS, an explicit succinct representation of the suffix tree topology.
暂无评论