The Industrial Metaverse is driving a new revolution wave for smart manufacturing domain by reproducing the real industrial environment in a virtual space. Real-time synchronization and rendering of all industrial fac...
详细信息
The Industrial Metaverse is driving a new revolution wave for smart manufacturing domain by reproducing the real industrial environment in a virtual space. Real-time synchronization and rendering of all industrial factors result in numerous time-sensitive and computation-intensive tasks, especially matrix multiplication. distributed edge computing (DEC) can be exploited to handle these tasks due to its low-latency and powerful computing. In this paper, we propose an efficient and reliable coded DEC framework to compute large-scale matrix multiplication tasks. However, an existence of stragglers causes high computation latency that seriously limits the application of DEC in the Industrial Metaverse. To mitigate the impact of stragglers, we design a secure and flexible PolyDot (SFPD) code, which enables information theoretic security (ITS) protection. Several improvements can be achieved with the proposed SFPD. First, it can achieve a smaller recovery threshold than that of the existing codes in almost all settings. And compared with the original PolyDot codes, our SFPD code considers the extra workers required to add ITS protection. It also provides a flexible tradeoff between recovery threshold and communication & computation loads by simply adjusting two given storage parameters p and t. Furthermore, as an important application scenario, the SFPD code is employed to secure model training in machine learning, which can alleviate the straggler effects and protect ITS of raw data. The experiments demonstrate that the SFPD code can significantly speed up the training process while providing ITS of data. Finally, we provide comprehensive performance analysis which shows the superiority of the SFPD code.
coded distributed computing (CDC) introduced by Li et al. in 2015 offers an efficient approach to trade computing power to reduce the communication load in general distributedcomputing frameworks such as MapReduce an...
详细信息
coded distributed computing (CDC) introduced by Li et al. in 2015 offers an efficient approach to trade computing power to reduce the communication load in general distributedcomputing frameworks such as MapReduce and Spark. In particular, increasing the computation load in the Map phase by a factor of r can create coded multicasting opportunities to reduce the communication load in the Shuffle phase by the same factor. However, the CDC scheme is designed for the homogeneous settings, where each node maps the same number of files and is assigned the same number of reduce functions. It requires an exponentially large number of input files (data batches), reduce functions and multicasting groups relative to the number of nodes to achieve the promised gain. We address the CDC limitations by proposing a novel CDC approach based on a combinatorial design, which accommodates heterogeneous networks and maintains a multiplicative computation-communication trade-off. In addition, the proposed approach requires an exponentially less number of input files compared to the original CDC scheme proposed by Li et al. Finally, we derive a new information theoretic converse for general heterogeneous CDC and show that the communication load of the proposed design is optimal within a constant factor.
In recent years, coded distributed computing (CDC) has attracted significant attention, because it can efficiently facilitate many delay-sensitive computation tasks against unexpected latencies in distributed computin...
详细信息
In recent years, coded distributed computing (CDC) has attracted significant attention, because it can efficiently facilitate many delay-sensitive computation tasks against unexpected latencies in distributedcomputing systems. Despite such a salient feature, many design challenges and opportunities remain. In this paper, we focus on practical computing systems with heterogeneous computing resources, and design a novel CDC approach, called batch-processing based codedcomputing (BPCC), which exploits the fact that every computing node can obtain some coded results before it completes the whole task. To this end, we first describe the main idea of the BPCC framework, and then formulate an optimization problem for BPCC to minimize the task completion time by configuring the computation load. Through formal theoretical analyses, extensive simulation studies, and comprehensive real experiments on the Amazon EC2 computing clusters, we demonstrate promising performance of the proposed BPCC scheme, in terms of high computational efficiency and robustness to uncertain disturbances.
In distributedcomputing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after som...
详细信息
ISBN:
(纸本)9781538674628
In distributedcomputing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after some time based on the delay pattern of the workers. While most of the existing work with reactive mitigation strategy only considered task replication, we propose a coded reactive straggler mitigation with an uncoded and a coded phase for distributed matrix-matrix multiplication. Specifically, in the uncoded phase of the proposed reactive strategy, the master distributes the computational job without redundancy among workers and waits for some time. After the waiting time, the master cancels the remaining tasks. It then encodes the remaining tasks and distributes them among the workers that have already completed their computations. The expected execution time of the proposed method is analytically obtained. Furthermore, the optimal waiting time for the uncoded phase and the optimal code rate for the coded phase are investigated. Our simulation results demonstrate that the proposed coded reactive mitigation strategy significantly decreases the execution time in comparison with the proactive mitigation strategy or repetition-based reactive mitigation strategy.
In this paper, due to the important value in practical applications, we consider the codeddistributed matrix multiplication problem of computing AA(inverted perpendicular) in a distributedcomputing system with N wor...
详细信息
In this paper, due to the important value in practical applications, we consider the codeddistributed matrix multiplication problem of computing AA(inverted perpendicular) in a distributedcomputing system with N worker nodes and a master node, where the input matrices A and A(inverted perpendicular) are partitioned into m-by-p and p-by-m blocks of equal-size sub-matrices respectively. For effective straggler mitigation, we propose a novel computation strategy, named folded polynomial code, which is obtained by modifying the entangled polynomial codes. Moreover, we characterize a lower bound on the optimal recovery threshold among all linear computation strategies when the underlying field is the real number field, and our folded polynomial codes can achieve this bound in the case of m = 1. Compared with all known computation strategies for codeddistributed matrix multiplication, our folded polynomial codes outperform them in terms of recovery threshold, download cost, and decoding complexity.
Recently, coding has been a useful technique to mitigate stragglers' effect in distributedcomputing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clus...
详细信息
Recently, coding has been a useful technique to mitigate stragglers' effect in distributedcomputing. However, coding in this context has been mainly explored assuming homogeneous workers, although real-world clusters often consist of heterogeneous workers with different computing capabilities. The uniform load allocation without considering the heterogeneity possibly causes a significant loss in latency. In this article, we suggest the optimal load allocation for coded distributed computing with heterogeneous workers. Specifically, we focus on the scenario that there exist workers having the same computing capability, which can be regarded as a group for analysis. We rely on the lower bound on the expected latency and obtain the optimal load allocation by showing that our load allocation achieves the minimum of the lower bound for a sufficiently large number of workers. Given the proposed optimal load allocation, we derive the optimal code rate to achieve the minimum expected latency. From numerical simulations, when assuming the group heterogeneity, our load allocation reduces the expected latency by orders of magnitude over the existing scheme. Furthermore, from experiments on Amazon EC2 for scenarios with distinct straggler/heterogeneity patterns, we observe that our scheme outperforms the competing schemes reducing the total finishing time by up to 52%.
Matrix multiplication is a fundamental building block for large scale computations arising in various applications, including machine learning. There has been significant recent interest in using coding to speed up di...
详细信息
ISBN:
(纸本)9781479981311
Matrix multiplication is a fundamental building block for large scale computations arising in various applications, including machine learning. There has been significant recent interest in using coding to speed up distributed matrix multiplication, that are robust to stragglers (i.e., machines that may perform slower computations). In many scenarios, instead of exact computation, approximate matrix multiplication, i.e., allowing for a tolerable error is also sufficient. Such approximate schemes make use of randomization techniques to speed up the computation process. In this paper, we initiate the study of approximate coded matrix multiplication, and investigate the joint synergies offered by randomization and coding. Specifically, we propose two coded randomized sampling schemes that use (a) codes to achieve a desired recovery threshold and (b) random sampling to obtain approximation of the matrix multiplication. Tradeoffs between the recovery threshold and approximation error obtained through random sampling are investigated for a class of coded matrix multiplication schemes.
We consider a distributedcomputing framework where the distributed nodes have different communication capabilities, motivated by the heterogeneous networks in data centers and mobile edge computing systems. Following...
详细信息
We consider a distributedcomputing framework where the distributed nodes have different communication capabilities, motivated by the heterogeneous networks in data centers and mobile edge computing systems. Following the structure of MapReduce, this framework consists of Map computation phase, Shuffle phase, and Reduce computation phase. The Shuffle phase allows distributed nodes to exchange intermediate values, in the presence of heterogeneous communication bottlenecks for different nodes (heterogeneous communication load constraints). For this setting, we characterize the minimum total computation load and the minimum worst-case computation load in some cases, under the heterogeneous communication load constraints. While the total computation load depends on the sum of the computation loads of all the nodes, the worst-case computation load depends on the computation load of a node with the heaviest job. We show an interesting insight that, for some cases, there is a tradeoff between the minimum total computation load and the minimum worst-case computation load, in the sense that both cannot be achieved at the same time. The achievability schemes are proposed with careful design on the file assignment and the data shuffling. Beyond the cut-set bound, a novel converse is proposed using the proof by contradiction. For the general case, we identify two extreme regimes in which both the scheme with coding and the scheme without coding are optimal, respectively.
The development of smart vehicles and rich cloud services have led to the emergence of vehicular edge computing. To perform the distributed computation tasks efficiently, coded distributed computing (CDC) was proposed...
详细信息
The development of smart vehicles and rich cloud services have led to the emergence of vehicular edge computing. To perform the distributed computation tasks efficiently, coded distributed computing (CDC) was proposed to reduce communication costs and mitigate the straggler effects through the use of coding techniques. In this paper, we propose a double auction mechanism to allocate the resources of the edge servers to the vehicles in order to complete the CDC tasks. Specifically, the vehicles use the PolyDot codes to manage the tradeoff between communication costs and recovery threshold. Given the requirements of various vehicles, the double auction mechanism matches the edge servers with the required resources to the vehicles. Besides, the double auction mechanism also determines the prices that the vehicles need to pay for the resources of the edge servers. The analyses show that the double auction mechanism satisfies the properties of individual rationality, incentive compatibility and budget-balance. From the simulation, the utility of auctioneer increases when the number of vehicles and edge servers increases.
The implementation of many Unmanned Aerial Vehicle (UAV) applications (e.g., fire detection, surveillance, and package delivery) requires extensive computing resources to achieve reliable performance. Existing solutio...
详细信息
The implementation of many Unmanned Aerial Vehicle (UAV) applications (e.g., fire detection, surveillance, and package delivery) requires extensive computing resources to achieve reliable performance. Existing solutions that offload computation tasks to the ground may suffer from long communication delays. To address this issue, the Networked Airborne computing (NAC) is a promising technique, which offers advanced onboard airborne computing capabilities by sharing resources among the UAVs via direct flight-to-flight links. However, NAC does not exist yet and enabling it requires overcoming many technical challenges, such as the high UAV mobility, and the uncertain, heterogeneous, and dynamic airspace. This paper addresses these challenges by 1) developing a Dynamic Batch-Processing based coded Computation (D-BPCC) framework for achieving robust and adaptable cooperative airborne computing, and 2) designing deep reinforcement learning (DRL) based load allocation and UAV mobility control strategies for optimizing the system performance. As the first study to systematically investigate NAC, to the best of our knowledge, we evaluate the proposed methods through designing a NAC simulator and conducting comparative studies with four state-of-the-art distributedcomputing schemes. The results demonstrate the promising performance of the proposed methods.
暂无评论