Determining termination in dynamic environments is hard due to node joining and leaving. In previous studies on termination detection, some structures, such as spanning tree or computational tree, are used. In this wo...
详细信息
ISBN:
(纸本)9783642364242;9783642364235
Determining termination in dynamic environments is hard due to node joining and leaving. In previous studies on termination detection, some structures, such as spanning tree or computational tree, are used. In this work, we present an unstructured termination detection algorithm, which uses a gossip based scheme to cope with scalability and fault-tolerance issues. This approach allows the algorithm not to maintain specific structures even when nodes join and leave during runtime. These dynamic behaviors are prevalent in cloud computing environments and little attention has been paid by existing approaches. To measure the complexity of our proposed algorithm, a new metric, self-centered message complexity is used. Our evaluation over scalable settings shows that an unstructured approach has a significant merit to solve scalability and fault-tolerance problems with lower message complexity over existing algorithms.
To detect deadlock in distributed systems, the initiator should construct an efficient explicit or implicit global wait-for graph. In this paper, we present an unstructured deadlock detection algorithm using a gossip ...
详细信息
To detect deadlock in distributed systems, the initiator should construct an efficient explicit or implicit global wait-for graph. In this paper, we present an unstructured deadlock detection algorithm using a gossip protocol in cloud computing environments, where constituting nodes may join and leave at any time. Because of the inherit properties of a gossip protocol, we argue that our proposed deadlock detection algorithm is scalable, fault-tolerant, and efficient, retaining safety and liveness properties. The correctness proof of the algorithm is also provided. The message complexity of our proposed algorithm is O(n), where n is the number of nodes. Our performance evaluation with scalable settings shows that our approach has a significant advantage over previous deadlock detection algorithms in terms of solving scalability, fault-tolerance, and complexity-efficiency issues. Copyright (c) 2013 John Wiley & Sons, Ltd.
Tensor completion is a powerful tool used to estimate or recover missing values in multi way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often ac...
详细信息
Tensor completion is a powerful tool used to estimate or recover missing values in multi way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems. (C) 2017 Elsevier B.V. All rights reserved.
暂无评论