distributed deep learning systems commonly use synchronous data parallelism to train models. However, communication overhead can be costly in distributed environments with limited communication bandwidth. To reduce co...
详细信息
Federated Learning (FL) is an emerging machine learning paradigm that enables the collaborative training of a shared global model across distributed clients while keeping the data decentralized. Recent works on design...
详细信息
ISBN:
(纸本)9798350395679;9798350395662
Federated Learning (FL) is an emerging machine learning paradigm that enables the collaborative training of a shared global model across distributed clients while keeping the data decentralized. Recent works on designing systems for efficient FL have shown that utilizing serverless computing technologies, particularly Function-as-a-Service (FaaS) for FL, can enhance resource efficiency, reduce training costs, and alleviate the complex infrastructure management burden on data holders. However, current serverless FL systems still suffer from the presence of stragglers, i.e., slow clients that impede the collaborative training process. While strategies aimed at mitigating stragglers in these systems have been proposed, they overlook the diverse hardware resource configurations among FL clients. To this end, we present Apodotiko, a novel asynchronous training strategy designed for serverless FL. Our strategy incorporates a scoring mechanism that evaluates each client's hardware capacity and dataset size to intelligently prioritize and select clients for each training round, thereby minimizing the effects of stragglers on system performance. We comprehensively evaluate Apodotiko across diverse datasets, considering a mix of CPU and GPU clients, and compare its performance against five other FL training strategies. Results from our experiments demonstrate that Apodotiko outperforms other FL training strategies, achieving an average speedup of 2.75x and a maximum speedup of 7.03x. Furthermore, our strategy significantly reduces cold starts by a factor of four on average, demonstrating suitability in serverless environments.
While hierarchically low-rank compression methods are now commonly available in both dense and sparse direct solvers, their usage for the direct solution of coupled sparse/dense linear systems has been little investig...
详细信息
ISBN:
(纸本)9781665481069
While hierarchically low-rank compression methods are now commonly available in both dense and sparse direct solvers, their usage for the direct solution of coupled sparse/dense linear systems has been little investigated. The solution of such systems is though central for the simulation of many important physics problems such as the simulation of the propagation of acoustic waves around aircrafts. Indeed, the heterogeneity of the jet flow created by reactors often requires a Finite Element Method (FEM) discretization, leading to a sparse linear system, while it may be reasonable to assume as homogeneous the rest of the space and hence model it with a Boundary Element Method (BEM) discretization, leading to a dense system. In an industrial context, these simulations are often operated on modern multicore workstations with fully-featured linear solvers. Exploiting their low-rank compression techniques is thus very appealing for solving larger coupled sparse/dense systems (hence ensuring a finer solution) on a given multicore workstation, and of course - possibly do it fast. The standard method performing an efficient coupling of sparse and dense direct solvers is to rely on the Schur complement functionality of the sparse direct solver. However, to the best of our knowledge, modern fully-featured sparse direct solvers offering this functionality return the Schur complement as a non compressed matrix. In this paper, we study the opportunity to process larger systems in spite of this constraint. For that we propose two classes of algorithms, namely multi-solve and multi-factorization, consisting in composing existing parallel sparse and dense methods on well chosen submatrices. An experimental study conducted on a 24 cores machine equipped with 128 GiB of RAM shows that these algorithms, implemented on top of state-of-the-art sparse and dense direct solvers, together with proper low-rank assembly schemes, can respectively process systems of 9 million and 2.5 million to
Finding patterns in large highly connected datasets is critical for value discovery in business development and scientific research. This work focuses on the problem of subgraph matching on streaming graphs, which pro...
详细信息
ISBN:
(纸本)9781665481069
Finding patterns in large highly connected datasets is critical for value discovery in business development and scientific research. This work focuses on the problem of subgraph matching on streaming graphs, which provides utility in a myriad of real-world applications ranging from social network analysis to cybersecurity. Each application poses a different set of control parameters, including the restrictions for a match, type of data stream, and search granularity. The problem-driven design of existing subgraph matching systems makes them challenging to apply for different problem domains. This paper presents Mnemonic, a programmable system that provides a high-level API and democratizes the development of a wide variety of subgraph matching solutions. Importantly, Mnemonic also delivers key data management capabilities and optimizations to support real-time processing on long-running, high-velocity multi-relational graph streams. The experiments demonstrate the versatility of Mnemonic, as it outperforms several state-of-the-art systems by up to two orders of magnitude.
Computer simulations of physical phenomena, such as heat transfer, often require the solution of linear equations. These linear equations occur in the form Ax = b, where A is a matrix, b is a vector, and x is the vect...
详细信息
ISBN:
(纸本)9781665497473
Computer simulations of physical phenomena, such as heat transfer, often require the solution of linear equations. These linear equations occur in the form Ax = b, where A is a matrix, b is a vector, and x is the vector of unknowns. Iterative methods are the most adapted to solve large linear systems because they can be easily parallelized. This paper presents a variant of the multisplitting iterative method with convergence acceleration using the Krylov-based minimization method. This paper particularly focuses on improving the convergence speed of the method with an implementation based on the PETSc (Portable Extensible Toolkit for Scientific Computation) library. This was achieved by reducing the need for synchronization - data exchange - during the minimization process and adding a preconditioner before the multisplitting method. All experiments were performed either over one or two sites of the Grid5000 platform and up to 128 cores were used. The results for solving a 2D Laplacian problem of size 1024(2) components, show a speed up of up to 23X and 86X when respectively compared to the algorithm in [8] and to the general multisplitting implementation.
Software development of high-performance graph algorithms is difficult on modern parallel computers. To simplify this task, we have designed and implemented a collection of C++ graph primitives, basic building blocks,...
详细信息
Edge computing draws a lot of recent research interests because of the performance improvement by offloading many workloads from the remote data center to nearby edge nodes. Nonetheless, one open challenge of this eme...
详细信息
ISBN:
(纸本)9781665481069
Edge computing draws a lot of recent research interests because of the performance improvement by offloading many workloads from the remote data center to nearby edge nodes. Nonetheless, one open challenge of this emerging paradigm lies in the potential security issues on edge nodes. This paper proposes a cooperative protocol, namely DEAN, equipped with a unique resource-efficient quorum building mechanism to adopt blockchain seamlessly in an edge computing infrastructure to prevent data manipulation and allow fair data sharing with quick recovery under resource constraints of limited storage, computing, and network capacity. Specifically, DEAN leverages a parallel mechanism equipped with three independent core components, effectively achieving low resource consumption while allowing secured parallel block processing on edge nodes. We have implemented a system prototype based on DEAN and experimentally verified its effectiveness with a comparison with four popular blockchain implementations: Ethereum, Parity, IOTA, and Hyperledger Fabric. Experimental results show that the system prototype exhibits high resilience to arbitrary failures. Performance-wise, DEAN-based blockchain implementation outperforms the state-of-the-art blockchain systems with up to 88.6x higher throughput and 26x lower latency.
Data stores utilized in modern data-intensive applications are expected to demonstrate rapid read and write capabilities and robust fault tolerance. Byzantine fault-tolerant database (BFT database) can execute transac...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
Data stores utilized in modern data-intensive applications are expected to demonstrate rapid read and write capabilities and robust fault tolerance. Byzantine fault-tolerant database (BFT database) can execute transactions concurrently and tolerate arbitrary faults (Byzantine fault). We consider cryptographic and communication processing as performance bottlenecks in the transaction processing of BFT databases. This paper presents a transaction reconstruction method, re-constructing a single transaction from multiple transactions to streamline cryptographic and communication processes. We evaluated the proposed method with Basil (state-of-the-art BFT database) in experiments. In an environment where nodes are geographically centralized, the proposed method demonstrates up to approximately 2.5 times higher throughput and reduces latency by up to about 30% than vanilla Basil. In an environment where nodes are geographically distributed, the proposed method demonstrates up to approximately 50 times higher throughput and reduces latency by up to about 75% than vanilla Basil.
Data-parallel deep neural networks (DNN) training systems deployed across nodes have been widely used in various domains, while the system performance is often bottlenecked by the communication overhead among workers ...
详细信息
ISBN:
(纸本)9798400704130
Data-parallel deep neural networks (DNN) training systems deployed across nodes have been widely used in various domains, while the system performance is often bottlenecked by the communication overhead among workers for synchronizing gradients. Top-k sparsification compression is the de facto approach to alleviate the communication bottleneck, which truncates the gradient to its largest.. elements before sending it to other nodes. However, we observe that the traditional Top-k still has performance issues: i) the gradient at each layer of a DNN is typically represented as a tensor of multiple dimensions, and the largest.. elements selected by the traditional Top-k are centered in only some of all dimensions and hence the training may miss many dimensions (we call dimension missing), which leads to low convergence performance;ii) the traditional Top-k performs the selection by globally sorting the gradient elements in each layer (we call single global sorting), which leads to a low GPU core parallelism and hence a low training throughput. In this paper, we propose an all-dimension Top-k sparsification scheme, called ADTopk, which selects the largest.. elements from all dimensions of the gradient tensor in each layer, meaning that each dimension must provide some elements, so as to avoid the dimension missing. Further, ADTopk enables each dimension to perform sorting locally within the elements of the dimension, and thus all dimensions can perform multiple local sortings independently and parallelly, instead of a single global sorting for the entire gradient tensor in each layer. On top of ADTopk, we further propose an interleaving compression scheme and an efficient threshold estimation algorithm so as to enhance the performance of ADTopk. We build a sparsification compression data-parallel DNN training framework and implement a compression library containing state-of-the-art sparsification algorithms. Experiments on a local cluster and Alibaba Cloud show that compa
暂无评论