Structured Support Vector Machines (structured SVMs) are a fundamental machine learning algorithm, and have solid theoretical foundation and high effectiveness in applications such as natural language parsing and comp...
详细信息
Structured Support Vector Machines (structured SVMs) are a fundamental machine learning algorithm, and have solid theoretical foundation and high effectiveness in applications such as natural language parsing and computer vision. However, training structured SVMs is very time-consuming, due to the large number of constraints and inferior convergence rates, especially for large training data sets. The high cost of training structured SVMs has hindered its adoption to new applications. In this article, we aim to improve the efficiency of structured SVMs by proposing a parallel and distributed solution (namely FastSSVM) for training structured SVMs building on top of MPI and OpenMP. FastSSVM exploits a series of optimizations (e.g., optimizations on data storage and synchronization) to efficiently use the resources of the nodes in a cluster and the cores of the nodes. Moreover, FastSSVM tackles the large constraint set problem by batch processing and addresses the slow convergence challenge by adapting stop conditions based on the improvement of each iteration. We theoretically prove that our solution is guaranteed to converge to a global optimum. A comprehensive experimental study shows that FastSSVM can achieve at least four times speedup over the existing solutions, and in some cases can achieve two to three orders of magnitude speedup.
Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and natural language processing. One of the reasons...
详细信息
Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and natural language processing. One of the reasons for this success is the huge amount parameters adopted in DL models. However, it is impractical to train a moderately large model with a large number of parameters on a typical single device. Thus, It is necessary to train DL models in clusters with distributedtraining algorithms. However, traditional distributedtraining algorithms are usually sub-optimal and highly customized, which owns the drawbacks to train large-scale DL models in varying computing clusters. To handle the above problem, researchers propose auto-parallelism, which is promising to train large-scale DL models efficiently and practically in various computing clusters. In this survey, we perform a broad and thorough investigation on challenges, basis, and strategy searching methods of auto-parallelism in DL training. First, we abstract basic parallelism schemes with their communication cost and memory consumption in DL training. Further, we analyze and compare a series of current auto-parallelism works and investigate strategies and searching methods which are commonly used in practice. At last, we discuss several trends in auto-parallelism which are promising in further research.
Deep Neural Network (DNN) frameworks need parallelism plans to execute immense models. The computed plans often combine data, model, and pipeline parallelism. Unfortunately, due to the intractable property of the prob...
详细信息
ISBN:
(纸本)9798350383461;9798350383454
Deep Neural Network (DNN) frameworks need parallelism plans to execute immense models. The computed plans often combine data, model, and pipeline parallelism. Unfortunately, due to the intractable property of the problem, the current parallelism planners often fail to derive plans for immense DNNs. They either rely on experts to generate plans manually or profiling for their evaluation, making planners expensive and sub-optimal. We propose RAPID, an automatic parallelism planner for immense DNNs driven by a hierarchical abstract machine model. This model enables the design of a symbolic-based cost model that achieves robust prediction of parallelism cost with symbolic simplification. RAPID divides the parallelization problem hierarchically and symmetrically into linear-time sub-problems. We prove that the composition of the sub-problem solutions is optimal. Large-scale cluster experiments show that RAPID can reduce the planning time of immense DNNs (e.g., BERT) by up to 67x compared to state-of-the-art planners;while exhibiting high performance that matches expert-optimized plans.
暂无评论