An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece and compute a representative solution from it, and finally ...
详细信息
ISBN:
(纸本)9781450335362
An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece and compute a representative solution from it, and finally obtain a solution inside the union of the representative solutions for all pieces. This technique can be captured via the concept of composable core-sets, and has been recently applied to solve diversity maximization problems as well as several clustering problems [7, 15, 8]. However, for coverage and submodular maximization problems, impossibility bounds are known for this technique [15]. In this paper, we focus on efficient construction of a randomized variant of composable core-sets where the above idea is applied on a random clustering of the data. We employ this technique for the coverage, monotone and non-monotone submodular maximization problems. Our results significantly improve upon the hardness results for non-randomized core-sets, and imply improved results for sub modular maximization in a distributed and streaming settings. The effectiveness of this technique has been confirmed empirically for several machine learning applications [22], and our proof provides a theoretical foundation to this idea. In summary, we show that a simple greedy algorithm results in a 1/3-approximate randomized composable core set for submodular maximization under a cardinality constraint. Our result also extends to non-monotone submodular functions, and leads to the first 2-round mapreduce-based constant-factor approximation algorithm with O(n) total communication complexity for either monotone or non monotone functions. Finally, using an improved analysis technique and a new algorithm PseudoGreedy, we present an improved 0.545-approximation algorithm for monotone sub modular maximization, which is in turn the first mapreduce-based algorithm beating factor 1/2 in a constant number of rounds.
As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions o...
详细信息
ISBN:
(纸本)9781450333177
As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as mapreduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of mapreduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance. In this tutorial, we first discuss how to design and implement algorithms based on traditional mapreduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for mapreduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting mapreduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASyMP, one can improve the CPU usage, and achieve significantly improved running time.
We present two Massively Parallel Computation (MPC) algorithms for the Minimum Cut problem: an O(1)-round exact algorithm with (O) over tilde (n) memory per machine, and an O(logn . log logn) round (2+epsilon) approxi...
详细信息
ISBN:
(纸本)9781450375825
We present two Massively Parallel Computation (MPC) algorithms for the Minimum Cut problem: an O(1)-round exact algorithm with (O) over tilde (n) memory per machine, and an O(logn . log logn) round (2+epsilon) approximation with (O) over tilde (n(alpha)) memory per machine, for any positive constant alpha < 1. Both algorithms use <(O)over tilde>(m) global memory.
暂无评论