We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative ...
详细信息
ISBN:
(纸本)9781479980062
We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. In this context, recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. However, due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts in meeting the constraints of real-time social media stream clustering through parallelization in Cloud DIKW. Specifically, we focus on two system-level issues. Firstly, most stream processing engines such as Apache Storm organize distributed workers in the form of a directed acyclic graph (DAG), which makes it difficult to dynamically synchronize the state of parallel clustering workers. We tackle this challenge by creating a separate synchronization channel using a pub-sub messaging system (ActiveMQ in our case). Secondly, due to the sparsity of the high-dimensional vectors, the size of centroids grows quickly as new data points are assigned to the clusters. As a result, traditional synchronization that directly broadcasts cluster centroids becomes too expensive and limits the scalability of the parallel algorithm. We address this problem by communicating only dynamic changes of the clusters rather than the whole centroid vectors. Our algorithm under Cloud DIKW can process the Twitter 10% data stream ("gardenhose") in real-time with 96-way parallelism. By natural improvements to Cloud DIKW, including advanced collective communication techniques developed in our Harp project, we will be able to process the full Twitter data stream in real-time with 1000-way parallelism. Our use of powerful gene
Finding the number of triangles in a graph (network) is an important problem in graph analysis. The number of triangles also has important applications in graph mining. Big graphs emerging from numerous application ar...
详细信息
ISBN:
(纸本)9781479999255
Finding the number of triangles in a graph (network) is an important problem in graph analysis. The number of triangles also has important applications in graph mining. Big graphs emerging from numerous application areas pose a significant challenge for the analysis and mining since these graphs consist of millions, or even billions, of nodes and edges. Graphs of such scale necessitate the development of efficient parallel algorithms. Existing distributed memory parallel algorithms for counting exact triangles are either Map-Reduce or message passing interface (MPI) based. Map-Reduce based algorithms generate prohibitively large intermediate data and do not demonstrate reasonably good runtime efficiency. The MPI based algorithms offer fast computation of the number of triangles. However, the partitioning and load balancing schemes these algorithms employ are static in nature- the partitions are precomputed based on some estimations. In this paper, we present an efficient MPI-based parallel algorithm for counting triangles in large graph. We consider the case where the main memory of each compute node is large enough to contain the entire graph. We observe that for such a case, computation load can be balanced dynamically and present a dynamic load balancing scheme which improves the performance of the algorithm significantly. Our algorithm demonstrates very good speedups and scales to a large number of processors. The algorithm computes the exact number of triangles in a network with 1 billion edges in 2 minutes with only 100 processors. Our results demonstrate that the algorithm is significantly faster than the related algorithms with static partitioning. In fact, for the real-world networks we experimented on, our algorithm achieves at least 2 times runtime efficiency over the fastest algorithm with static load balancing.
Semisorting is the problem of reordering an input array of keys such that equal keys are contiguous but different keys are not necessarily in sorted order. Semisorting is important for collecting equal values and is w...
详细信息
ISBN:
(纸本)9781450335881
Semisorting is the problem of reordering an input array of keys such that equal keys are contiguous but different keys are not necessarily in sorted order. Semisorting is important for collecting equal values and is widely used in practice. For example, it is the core of the MapReduce paradigm, is a key component of the database join operation, and has many other applications. We describe a (randomized) parallel algorithm for the problem that is theoretically efficient (linear work and logarithmic depth), but is designed to be more practically efficient than previous algorithms. We use ideas from the parallel integer sorting algorithm of Rajasekaran and Reif, but instead of processing bits of a integers in a reduced range in a bottom-up fashion, we process the hashed values of keys directly top-down. We implement the algorithm and experimentally show on a variety of input distributions that it outperforms a similarly-optimized radix sort on a modern 40-core machine with hyper-threading by about a factor of 1.7-1.9, and achieves a parallel speedup of up to 38x. We discuss the various optimizations used in our implementation and present an extensive experimental analysis of its performance.
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and fas...
详细信息
ISBN:
(纸本)9781467376846
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and faster in the parallel setting. We use this property to first decompose the graph into independent and maximal 2-edge-connected subgraphs. To identify the articulation points in these 2-edge connected subgraphs, we again convert this into a problem of finding the bridges on an auxiliary graph. It is interesting to note that during the conversion process, the size of the graph may increase. However, we show that this small increase in size and the run time is offset by the consideration that finding bridges is easier in a parallel setting. We implement our algorithm on an Intel i7 980X CPU running 12 threads. We show that our algorithm is on average 2.45x faster than the best known current algorithms implemented on the same platform.
GPUs have been gaining acceptance in the electronic design automation field as attractive platforms for implementing and accelerating computationally extensive applications. Researchers agree that it is critical that ...
详细信息
ISBN:
(纸本)9781479988280
GPUs have been gaining acceptance in the electronic design automation field as attractive platforms for implementing and accelerating computationally extensive applications. Researchers agree that it is critical that EDA algorithms exploit future platforms and explore the use of parallel algorithms as we move to the manycore era. This paper describes the implementation of the TimberWolf placement algorithm using CUDA and demonstrates the applicability of GPUs in accelerating electronic design automation tools. The algorithm has been implemented on a Xeon Workstation using C, and achieved a substantial acceleration on an Nvidia Tesla C2070 card.
We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncom...
详细信息
ISBN:
(纸本)9781479984305
We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs. We integrate the compression techniques into Ligra, a recent shared-memory graph processing system. This system, which we call Ligra+, is able to represent graphs using about half of the space for the uncompressed graphs on average. Furthermore, Ligra+ is slightly faster than Ligra on average on a 40-core machine with hyper-threading. Our experimental study shows that Ligra+ is able to process graphs using less memory, while performing as well as or faster than Ligra.
Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in networ...
详细信息
ISBN:
(纸本)9781479989379
Finding the number of triangles in a network (graph) is an important problem in mining and analysis of complex networks. Massive networks emerging from numerous application areas pose a significant challenge in network analytics since these networks consist of millions, or even billions, of nodes and edges. Such massive networks necessitate the development of efficient parallel algorithms. There exist several MapReduce and an only MPI (Message Passing Interface) based distributed-memory parallel algorithms for counting triangles. MapReduce based algorithms generate prohibitively large intermediate data. The MPI based algorithm can work on quite large networks, however, the overlapping partitions employed by the algorithm limit its capability to deal with very massive networks. In this paper, we present a space-efficient MPI based parallel algorithm for counting exact number of triangles in massive networks. The algorithm divides the network into non-overlapping partitions. Our results demonstrate up to 25-fold space saving over the algorithm with overlapping partitions. This space efficiency allows the algorithm to deal with networks which are 25 times larger. We present a novel approach that reduces communication cost drastically (up to 90%) leading to both a space-and runtime-efficient algorithm. Our adaptation of a parallel partitioning scheme by computing a novel weight function adds further to the efficiency of the algorithm. Denoting average degree of nodes and the number of partitions by (d) over bar and P, respectively, our algorithm achieves up to O(P-2)-factor space efficiency over existing MapReduce based algorithms and up to (d) over bar -factor (approx.) over the algorithm with overlapping partitioning.
A compressed suffix tree usually consists of three components: a compressed suffix array, a compressed LCP-array, and a succinct representation of the suffix tree topology. There are parallel algorithms that construct...
详细信息
ISBN:
(纸本)9783319238265;9783319238258
A compressed suffix tree usually consists of three components: a compressed suffix array, a compressed LCP-array, and a succinct representation of the suffix tree topology. There are parallel algorithms that construct the suffix array and the LCP-array, but none for the third component. In this paper, we present parallel algorithms on shared memory architectures that construct the enhanced balanced parentheses representation (BPR). The enhanced BPR is an implicit succinct representation of the suffix tree topology, which supports all navigational operations on the suffix tree. It can also be used to efficiently construct the BPS, an explicit succinct representation of the suffix tree topology.
In this paper research in the field of application multiprocessor systems for genome assemblies reconciliation has been carried out. A large number of algorithmic approaches aimed to solve the task of de novo assembly...
详细信息
ISBN:
(纸本)9781467367981
In this paper research in the field of application multiprocessor systems for genome assemblies reconciliation has been carried out. A large number of algorithmic approaches aimed to solve the task of de novo assembly from short reads, however the results of their work on the same raw data often differ essentially. A parallel algorithm for merging two or more assemblies without relying on a reference genome is presented. Due to the large data volume the computations in the distributed memory model on computational cluster are required. The proposed method integrates a combination of draft assemblies reducing resulting contigsfragmentation. Sequential version of the algorithm is implemented in C/C++ and is available at https:***/kromanenkov/gar.
Formal modeling of the cost of MPI primitives allows a machine independent representation, comparison and performance analysis of their underlying algorithms. Current accepted methods are all the off-springs of LogP, ...
详细信息
Formal modeling of the cost of MPI primitives allows a machine independent representation, comparison and performance analysis of their underlying algorithms. Current accepted methods are all the off-springs of LogP, conceived to model the cost of inter-node point-to-point messages in networks of single-processor machines. As new supercomputers are built upon cheap commodity boards with a growing number of cores accessing hierarchical memories, intra-node communication becomes progressively more relevant. Techniques for shared memory communication, such as message segmentation and collectives, not based on point-to-point operations, are substantively different from their inter-node counterparts. This paper unveils the reasons for the poor fit of LogGP and the most recent models in this domain, log(n)P and mlog(n)P, and proposes a new model named tau-Lop, rooted on them, but addressing the challenge of accurately modeling shared memory MPI communications. Broadcast algorithms of mainstream MPI implementations, MPICH and Open MPI, are modeled and analyzed. (C) 2015 Elsevier B.V. All rights reserved.
暂无评论