检索结果-内蒙古大学图书馆

LOFTune: A Low-Overhead and Flexible Approach for spark sql Configuration Tuning

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2025年第6期37卷 3528-3542页

作者： Li, Jiahui Ye, Junhao Mao, Yuren Gao, Yunjun Chen, Lu Zhejiang Univ Sch Software Technol Ningbo 315048 Peoples R China Zhejiang Univ Coll Comp Sci Hangzhou 310027 Peoples R China

The query efficiency of spark sql is significantly impacted by its configurations. Therefore, configuration tuning has drawn great attention, and various automatic configuration tuning methods have been proposed. However, existing methods suffer from two issues: (1) high tuning overhead: they need to repeatedly execute the workloads several times to obtain the training samples, which is time-consuming;and (2) low throughput: they need to occupy resources like CPU cores and memory for a long time, causing other spark sql workloads to wait, thereby reducing the overall system throughput. These issues impede the use of automatic configuration tuning methods in practical systems which have limited tuning budget and many concurrent workloads. To address these issues, this paper proposes a Low-Overhead and Flexible approach for spark sql configuration Tuning, dubbed LOFTune. LOFTune reduces the tuning overhead via a sample-efficient optimization framework, which is proposed based on multi-task sql representation learning and multi-armed bandit. Furthermore, LOFTune solves the low throughput issue with a recommendation-sampling-decoupled tuning framework. Extensive experiments validate the effectiveness of LOFTune. In the sampling-allowed case, LOFTune can save up to 90% of the workload runs comparing with the state-of-the-art methods. Besides, in the zero-sampling case, LOFTune can reduce up to 41.26% of latency.

关键词： Tuning sparks Structured Query Language Throughput Vectors Optimization Knowledge based systems Predictive models Measurement Databases spark sql configuration tuning multi-task representation learning multi-armed bandit

来源：评论

学校读者我要写书评

暂无评论

QHB⁺: Accelerated Configuration Optimization for Automated Performance Tuning of spark sql Applications

引用

IEEE ACCESS 2024年 12卷 60138-60148页

作者： Jang, Deokyeon Yoon, Hyunsik Jung, Kijung Chung, Yon Dohn Korea Univ Dept Comp Sci & Engn Seoul 02841 South Korea

Apache spark stands out as a well-known solution for big data processing because of its efficiency and rapid processing capabilities. One of its modules, spark sql, serves as a prominent big data query engine. However, executing spark sql applications with massive data can be time-intensive, and the execution time can vary significantly depending on its configurations. Recent studies try to reduce application execution times by searching optimal configurations for applications. While Bayesian optimization is recognized as a powerful method in recent studies for configuration optimization, it faces challenges such as computational costs and time-consuming computations, especially when dealing with large search spaces Due to these challenges, we propose QHB(+), designed to rapidly search optimal configurations. QHB(+) utilizes the Successive Halving Algorithm-based optimization methods, performing well in hyperparameter optimization of machine learning models, for configuration optimization of spark sql applications. Through empirical evaluations against established benchmarks, we show the efficiency of QHB(+), highlighting them as swift alternatives to conventional optimization method for optimizing spark sql configurations.

关键词： sparks Optimization methods Machine learning Tuning Bayes methods Upper bound Configuration management Structured Query Language Big data configuration optimization spark sql hyperband

来源：评论

学校读者我要写书评

暂无评论

Rover: An Online spark sql Tuning Service via Generalized Transfer Learning 23

Rover: An Online Spark SQL Tuning Service via Generalized Tr...

引用

29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

作者： Shen, Yu Ren, Xinyuyang Lu, Yupeng Jiang, Huaijun Xu, Huanyong Peng, Di Li, Yang Zhang, Wentao Cui, Bin Peking Univ Sch CS Beijing Peoples R China ByteDance Inc Beijing Peoples R China Peking Univ Ctr Data Sci Beijing Peoples R China Mila Quebec AI Inst Montreal PQ Canada Peking Univ Inst Computat Social Sci Sch CS Beijing Peoples R China

ISBN: (纸本)9798400701030

Distributed data analytic engines like spark are common choices to process massive data in industry. However, the performance of spark sql highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for spark sql tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from spark engineers is of great potential to improve the tuning performance but is not well studied so far;2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online spark sql tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world spark sql tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.

关键词： spark sql Bayesian Optimization Transfer Learning

来源：评论

学校读者我要写书评

暂无评论

LOCAT: Low-Overhead Online Configuration Auto-Tuning of spark sql Applications 22

LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spar...

引用

International Conference on Management of Data (SIGMOD)

作者： Xin, Jinhan Hwang, Kai Yu, Zhibin Chinese Acad Sci Shenzhen Inst Adv Technol SIAT Shenzhen Guangdong Peoples R China Univ Chinese Acad Sci UCAS Beijing Peoples R China Chinese Univ Hong Kong Shenzhen Guangdong Peoples R China Shenzhen Huawei Cloud Comp Co Ltd Shuhai Lab Shenzhen Guangdong Peoples R China

ISBN: (纸本)9781450392495

spark sql has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others. To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of spark sql applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important ones. As such, LOCAT can tune the configurations of a spark sql application with low overhead and adapt to different input data sizes. We employ spark sql applications from benchmark suites TPC-DS, TPC - H, and HiBench running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of the state-of-the-art approaches by at least 4.1x and up to 9.7x;moreover, LOCAT improves the application performance by at least 1.9x and up to 2.4x. On the x86 cluster, LOCAT shows similar results to those on the ARM cluster.

关键词： big data in-memory computing spark spark sql

来源：评论

学校读者我要写书评

暂无评论

面向容器云的spark sql性能优化研究与实现

面向容器云的Spark SQL性能优化研究与实现

引用

作者：张天星贵州大学

学位级别：硕士

云计算具备大规模数据存算能力,成为现代数字经济发展的基础。在云环境中如何高效的处理大规模数据仍是一个亟待解决的难题。基于CPU处理结构化数据的性能表现不佳,GPU的出现带来了新的优化思路。然而,现有的容器云平台在GPU的集成与调... 详细信息

云计算具备大规模数据存算能力,成为现代数字经济发展的基础。在云环境中如何高效的处理大规模数据仍是一个亟待解决的难题。基于CPU处理结构化数据的性能表现不佳,GPU的出现带来了新的优化思路。然而,现有的容器云平台在GPU的集成与调度方面尚待完善,难以通过GPU并行化算力加快数据处理的速度。此外,大数据计算框架spark在容器云中涉及大量配置参数,不合理的配置无法发挥其最佳性能。为解决上述问题,本文研究在容器云中利用GPU并行化算力加速spark sql处理结构化数据,优化spark sql性能,提升数据处理效率。本文具体工作如下: (1)为全面直观地了解spark任务的运行状态,设计一种多维度立体化的资源性能监控方案。该方案从计算节点、GPU资源和spark计算框架三个层面监控spark任务在容器云中的运行状态。其通过可视化的方式清晰地呈现任务在执行过程中各种资源的使用情况,为后续spark sql性能优化提供有力支持。 (2)为提高分布式spark节点的计算性能,设计一种GPU资源负载感知调度方案。该方案在容器云中引入GPU感知插件集成GPU算力。通过扩展GPU的预选策略筛选满足任务需求的节点,形成候选节点集合。通过扩展GPU的优选策略评估候选节点CPU、内存和GPU资源的负载情况,为任务选择最佳的调度节点。实验结果表明,该方案提升GPU资源的利用率和数据的处理速度。 (3)针对在容器云中spark GPU加速场景,设计一种自适应配置参数优化方案。该方案在容器云中收集任务执行的日志数据,从该数据中识别出显著影响性能的参数,并结合贝叶斯优化算法自动寻找最优的配置。通过缓存优化经验的方式重用历史经验,进而加快优化的过程。实验结果表明,该方案提高spark计算框架的性能,降低任务的执行时间。本文实现一个面向容器云的spark sql性能优化系统。根据上述设计的方案,本文采用分层设计的理念构建系统的整体架构。该架构以存算分离的方式处理大规模数据,提高系统的灵活度。实验结果表明,spark sql整体的处理效率提升49%～65%。

关键词： spark sql GPU并行化算力负载感知调度参数优化

来源：评论

学校读者我要写书评

暂无评论

Handling Data Skew for Aggregation in spark sql Using Task Stealing

引用

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 2020年第6期48卷 941-956页

作者： He, Zeyu Huang, Qiuli Li, Zhifang Weng, Chuliang East China Normal Univ Sch Data Sci & Engn Shanghai Peoples R China

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in spark sql. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.

关键词： In-memory computing spark sql Aggregation Data skew

来源：评论

学校读者我要写书评

暂无评论

DQN-based Join Order Optimization by Learning Experiences of Running Queries on spark sql 20

DQN-based Join Order Optimization by Learning Experiences of...

引用

20th IEEE International Conference on Data Mining (ICDM)

作者： Lee, Kyeong-Min Kim, InA Lee, Kyu-Chul Chungnam Natl Univ Dept Comp Engn Daejeon South Korea

ISBN: (纸本)9781728190129

In a smart grid, various types of queries such as adhoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested for a large scale of data in the smart grid. In this paper, to improve the performance of retrieving a large scale of data in the smart grid environment, we propose a DQN-based join order optimization model on spark sql. The model learns the actual processing time of queries that are evaluated on spark sql, not the estimated costs. By learning the optimal join orders from previous experiences, we optimize the join orders with similar performance to spark sql without collecting and computing the statistics of an input data set.

关键词： smart grid join order optimization spark sql deep reinforcement learning deep q-network

来源：评论

学校读者我要写书评

暂无评论

基于收益模型的spark sql数据重用机制

引用

计算机研究与发展 2020年第2期57卷 318-332页

作者：申毅杰曾丹熊劲计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京100190 中国科学院大学北京100049

通过数据分析发现海量数据中的潜在价值,能够带来巨大的收益.spark具有良好的系统扩展性与处理性能,因而被广泛运用于大数据分析.spark sql是spark最常用的编程接口.在数据分析应用中存在着大量的重复计算,这些重复计算不仅浪费系统资源... 详细信息

通过数据分析发现海量数据中的潜在价值,能够带来巨大的收益.spark具有良好的系统扩展性与处理性能,因而被广泛运用于大数据分析.spark sql是spark最常用的编程接口.在数据分析应用中存在着大量的重复计算,这些重复计算不仅浪费系统资源,而且导致查询运行效率低.但是spark sql无法感知查询语句之间的重复计算.为此,提出了基于收益模型的、细粒度的自动数据重用机制Criss以减少重复计算.针对混合介质,提出了感知异构I O性能的收益模型用于自动识别重用收益最大的算子计算结果,并采用Partition粒度的数据重用和缓存管理,以提高查询效率和缓存空间的利用率,充分发挥数据重用的优势.基于spark sql和TachyonFS,实现了Criss系统.实验结果表明:Criss的查询性能比原始spark sql提升了46%~68%.

关键词：数据分析大数据 spark sql 重复计算数据重用收益模型

来源：评论

学校读者我要写书评

暂无评论

A Cost Model for spark sql

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2019年第5期31卷 819-832页

作者： Baldacci, Lorenzo Golfarelli, Matteo Univ Bologna DISI I-40126 Bologna Italy

In this paper, we propose a novel cost model for spark sql. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by spark. The set of operations adopted by spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20 percent of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14 percent, if the analytic model is coupled with our straggler handling strategy.

关键词： spark spark sql cost model query optimization

来源：评论

学校读者我要写书评

暂无评论

融合spark sql的系统误差性能评价技术指标研究

引用

电工技术 2022年第6期 29-32,63页

作者：朱国雄田彬陆会明华北电力大学控制与计算机工程学院北京102206 国能智深控制技术有限公司北京102211

提出了一种融合spark sql技术的系统误差性能评价方案,解决了在工业控制系统存在大量数据时往往需要对数据进行分析、处理的问题,通过对数据的统计分析得到当前控制系统的性能指标,由性能指标决定是否对控制过程方案做出相应调整,从而... 详细信息

提出了一种融合spark sql技术的系统误差性能评价方案,解决了在工业控制系统存在大量数据时往往需要对数据进行分析、处理的问题,通过对数据的统计分析得到当前控制系统的性能指标,由性能指标决定是否对控制过程方案做出相应调整,从而达到更好的控制效果。通过spark sql数据分析技术,使用Java语言和spark sql集成的API函数以及sql语句,结合误差积分性能指标的函数表达式进行算子程序开发。对确定性指标中时间平方误差积分ITSE进行了详细的开发论述,包括数据的读取、数据的处理及结果数据的获取,开发的ITSE算子能够对测量得到的工业控制系统数据进行分析处理,结合图像算子可以得到直观的系统性能指标变化趋势,以便运行人员了解控制系统性能,及时对系统控制方案做出调整。

关键词：数据分析 spark sql 大数据误差性能指标控制系统品质

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：