The query efficiency of spark sql is significantly impacted by its configurations. Therefore, configuration tuning has drawn great attention, and various automatic configuration tuning methods have been proposed. Howe...
详细信息
The query efficiency of spark sql is significantly impacted by its configurations. Therefore, configuration tuning has drawn great attention, and various automatic configuration tuning methods have been proposed. However, existing methods suffer from two issues: (1) high tuning overhead: they need to repeatedly execute the workloads several times to obtain the training samples, which is time-consuming;and (2) low throughput: they need to occupy resources like CPU cores and memory for a long time, causing other spark sql workloads to wait, thereby reducing the overall system throughput. These issues impede the use of automatic configuration tuning methods in practical systems which have limited tuning budget and many concurrent workloads. To address these issues, this paper proposes a Low-Overhead and Flexible approach for spark sql configuration Tuning, dubbed LOFTune. LOFTune reduces the tuning overhead via a sample-efficient optimization framework, which is proposed based on multi-task sql representation learning and multi-armed bandit. Furthermore, LOFTune solves the low throughput issue with a recommendation-sampling-decoupled tuning framework. Extensive experiments validate the effectiveness of LOFTune. In the sampling-allowed case, LOFTune can save up to 90% of the workload runs comparing with the state-of-the-art methods. Besides, in the zero-sampling case, LOFTune can reduce up to 41.26% of latency.
Apache spark stands out as a well-known solution for big data processing because of its efficiency and rapid processing capabilities. One of its modules, spark sql, serves as a prominent big data query engine. However...
详细信息
Apache spark stands out as a well-known solution for big data processing because of its efficiency and rapid processing capabilities. One of its modules, spark sql, serves as a prominent big data query engine. However, executing spark sql applications with massive data can be time-intensive, and the execution time can vary significantly depending on its configurations. Recent studies try to reduce application execution times by searching optimal configurations for applications. While Bayesian optimization is recognized as a powerful method in recent studies for configuration optimization, it faces challenges such as computational costs and time-consuming computations, especially when dealing with large search spaces Due to these challenges, we propose QHB(+), designed to rapidly search optimal configurations. QHB(+) utilizes the Successive Halving Algorithm-based optimization methods, performing well in hyperparameter optimization of machine learning models, for configuration optimization of spark sql applications. Through empirical evaluations against established benchmarks, we show the efficiency of QHB(+), highlighting them as swift alternatives to conventional optimization method for optimizing spark sql configurations.
Distributed data analytic engines like spark are common choices to process massive data in industry. However, the performance of spark sql highly depends on the choice of configurations, where the optimal ones vary wi...
详细信息
ISBN:
(纸本)9798400701030
Distributed data analytic engines like spark are common choices to process massive data in industry. However, the performance of spark sql highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for spark sql tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from spark engineers is of great potential to improve the tuning performance but is not well studied so far;2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online spark sql tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world spark sql tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
spark sql has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a ...
详细信息
ISBN:
(纸本)9781450392495
spark sql has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others. To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of spark sql applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important ones. As such, LOCAT can tune the configurations of a spark sql application with low overhead and adapt to different input data sizes. We employ spark sql applications from benchmark suites TPC-DS, TPC - H, and HiBench running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of the state-of-the-art approaches by at least 4.1x and up to 9.7x;moreover, LOCAT improves the application performance by at least 1.9x and up to 2.4x. On the x86 cluster, LOCAT shows similar results to those on the ARM cluster.
In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which mak...
详细信息
In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in spark sql. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.
In a smart grid, various types of queries such as adhoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested f...
详细信息
ISBN:
(纸本)9781728190129
In a smart grid, various types of queries such as adhoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested for a large scale of data in the smart grid. In this paper, to improve the performance of retrieving a large scale of data in the smart grid environment, we propose a DQN-based join order optimization model on spark sql. The model learns the actual processing time of queries that are evaluated on spark sql, not the estimated costs. By learning the optimal join orders from previous experiences, we optimize the join orders with similar performance to spark sql without collecting and computing the statistics of an input data set.
In this paper, we propose a novel cost model for spark sql. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well...
详细信息
In this paper, we propose a novel cost model for spark sql. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by spark. The set of operations adopted by spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20 percent of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14 percent, if the analytic model is coupled with our straggler handling strategy.
暂无评论