spark sql has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a ...
详细信息
ISBN:
(纸本)9781450392495
spark sql has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others. To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of spark sql applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important ones. As such, LOCAT can tune the configurations of a spark sql application with low overhead and adapt to different input data sizes. We employ spark sql applications from benchmark suites TPC-DS, TPC - H, and HiBench running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of the state-of-the-art approaches by at least 4.1x and up to 9.7x;moreover, LOCAT improves the application performance by at least 1.9x and up to 2.4x. On the x86 cluster, LOCAT shows similar results to those on the ARM cluster.
spark sql is a big data processing tool for structured data query and analysis. However, due to the execution of spark sql, there are multiple times to write intermediate data to the disk, which reduces the execution ...
详细信息
ISBN:
(纸本)9781728110257
spark sql is a big data processing tool for structured data query and analysis. However, due to the execution of spark sql, there are multiple times to write intermediate data to the disk, which reduces the execution efficiency of spark sql. Targeting on the existing issues, we design and implement an intermediate data cache layer between the underlying file system and the upper spark core to reduce the cost of random disk I/O. By using the query pre-analysis module, we can dynamically adjust the capacity of cache layer for different queries. And the allocation module can allocate proper memory for each node in cluster. This paper develops the SSO (spark sql Optimizer) module and integrates it into the original spark system to achieve the above functions. This paper compares the query performance with the existing spark sql by experiment data generated by TPC-H tool. The experimental results show that the SSO module can effectively improve the query efficiency, reduce the disk I/O cost and make full use of the cluster memory resources.
暂无评论