distributed machine learning systems train models via iterative updates between parallel workers and the pa-rameter *** expedite the transmissions,in-network aggregation of updates along with the packet forwarding at ...
详细信息
distributed machine learning systems train models via iterative updates between parallel workers and the pa-rameter *** expedite the transmissions,in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck ***,existing in-network ag-gregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to cap-ture the dynamic network *** on the status derived from in-band network telemetry,we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission *** the problem is actually a non-linear integer program,by adopting delicate transformations,a substitute with totally unimodular con-straints and separable convex objective is then solved to obtain the integral *** implement our in-network ag-gregation protocol and reconstruct in-band network telemetry protocol upon real devices,i.e.,Barefoot Wedge100BF and Dell ***,we evaluate the performance of our proposed AGG algorithm and the results indicate that the comple-tion of related coflows decreases 40%on average compared with other strategies,improving at least 30%performance,compared with the state-of-the-art.
暂无评论