检索结果-内蒙古大学图书馆

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2019年第5期31卷 819-832页

作者： Baldacci, Lorenzo Golfarelli, Matteo Univ Bologna DISI I-40126 Bologna Italy

In this paper, we propose a novel cost model for spark sql. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by spark. The set of operations adopted by spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20 percent of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14 percent, if the analytic model is coupled with our straggler handling strategy.

关键词： spark spark sql cost model query optimization

来源：评论

学校读者我要写书评

暂无评论

Handling Data Skew for Aggregation in spark sql Using Task Stealing

引用

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 2020年第6期48卷 941-956页

作者： He, Zeyu Huang, Qiuli Li, Zhifang Weng, Chuliang East China Normal Univ Sch Data Sci & Engn Shanghai Peoples R China

In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in spark sql. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible.

关键词： In-memory computing spark sql Aggregation Data skew

来源：评论

学校读者我要写书评

暂无评论

QHB⁺: Accelerated Configuration Optimization for Automated Performance Tuning of spark sql Applications

引用

IEEE ACCESS 2024年 12卷 60138-60148页

作者： Jang, Deokyeon Yoon, Hyunsik Jung, Kijung Chung, Yon Dohn Korea Univ Dept Comp Sci & Engn Seoul 02841 South Korea

Apache spark stands out as a well-known solution for big data processing because of its efficiency and rapid processing capabilities. One of its modules, spark sql, serves as a prominent big data query engine. However, executing spark sql applications with massive data can be time-intensive, and the execution time can vary significantly depending on its configurations. Recent studies try to reduce application execution times by searching optimal configurations for applications. While Bayesian optimization is recognized as a powerful method in recent studies for configuration optimization, it faces challenges such as computational costs and time-consuming computations, especially when dealing with large search spaces Due to these challenges, we propose QHB(+), designed to rapidly search optimal configurations. QHB(+) utilizes the Successive Halving Algorithm-based optimization methods, performing well in hyperparameter optimization of machine learning models, for configuration optimization of spark sql applications. Through empirical evaluations against established benchmarks, we show the efficiency of QHB(+), highlighting them as swift alternatives to conventional optimization method for optimizing spark sql configurations.

关键词： sparks Optimization methods Machine learning Tuning Bayes methods Upper bound Configuration management Structured Query Language Big data configuration optimization spark sql hyperband

来源：评论

学校读者我要写书评

暂无评论

Stargate: A data source connector based on spark sql 18

Stargate: A data source connector based on Spark SQL

引用

2nd International Conference on Machine Learning and Soft Computing (ICMLSC)

作者： Tao, Yuzheng Wu, Gang Kang, Yi Shanghai Jiao Tong Univ Dept Software Engn Shanghai Peoples R China Transwarp Technol Co Ltd Dept Framework Shanghai Peoples R China

ISBN: (纸本)9781450363365

spark sql has become a landing solution when a lot of enterprises in the face of massive data analysis and processing issues. To quickly and conveniently connect computing engines to data sources on different storage engines and help computing engine understanding and adapting the storage engine to improve the computational efficiency, we present a data source connector based on spark sql - Stargate, which provides a set of framework for different storage engine to connect to the spark sql computing engine. Experiments show that Stargate can perfectly match spark sql computing engine with multiple storage engines, and Stargate can help computing engines understand and adapt storage engines to improve the computational analysis efficiency.

关键词： Big Data Apache spark spark sql Data Source Connector

来源：评论

学校读者我要写书评

暂无评论

Indexing for Large Scale Data Querying based on spark sql 14

Indexing for Large Scale Data Querying based on Spark SQL

引用

14th IEEE International Conference on e-Business Engineering (ICEBE)

作者： Cui, Yi Li, Guoqiang Cheng, Hao Wang, Daoyuan Shanghai Jiao Tong Univ Sch Software Engn Shanghai Peoples R China Intel APAC Corp Shanghai Peoples R China

ISBN: (纸本)9781538614129

spark sql lets spark programmers query structured data inside spark programs using sql statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, spark sql is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using "cache" command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of spark sql based on Apache spark. Compared with spark sql, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.

关键词： Apache spark Big Data Big Data Indexing Searching and Querying spark sql spark as a Service

来源：评论

学校读者我要写书评

暂无评论

Query Optimization Approach with Middle Storage Layer for spark sql 22

Query Optimization Approach with Middle Storage Layer for Sp...

引用

22nd IEEE International Conference on Computer Supported Cooperative Work in Design (CSCWD)

作者： Song, Aibo Zhai, Mingyu Xue, Yingying Chen, Peng Du, Mingyang Wan, Yutong NARI Technol Dev Co Ltd Nanjing Jiangsu Peoples R China Southeast Univ Sch Comp Sci & Engn Nanjing Jiangsu Peoples R China

ISBN: (纸本)9781538614822

Currently, spark sql cannot optimize the multi-query tasks: tasks provided by batch processing are translated into different spark jobs, and these jobs cannot share input data. To solve this problem. this paper explores the optimization of translating an sql query into spark jobs via two strategies: 1) A Middle Storage Layer is added between the persistent file system and the spark core to address the data sharing problem among multiple queries. 2) When loading data into the Middle Storage Layer. the use of a selection strategy based on the cost model, in which the input data are distributed to multiple proper nodes for processing to achieve efficient utilization of cluster resources, is proposed. Based on this exploration, we develop a system called QOMS(Query Optimization approach with Middle Storage layer for spark sql), which offers better performance in improving query speed compared to spark sql with respect to the TPC-H benchmark.

关键词： spark sql Query Optimization Middle Storage Layer

来源：评论

学校读者我要写书评

暂无评论

Workload Driven Comparison and Optimization of Hive and spark sql 4

Workload Driven Comparison and Optimization of Hive and Spar...

引用

4th International Conference on Information Science and Control Engineering (ICISCE)

作者： Zhang, Man Liu, Fang Lu, Yutong Chen, Zhiguang Natl Univ Def Technol Coll Comp Changsha Hunan Peoples R China

ISBN: (纸本)9781538630136

This paper proposes how to conduct the specific job performance optimization of Hive and spark sql, and make a comparison of them at the same time. First, we compare Hive and spark sql by ten sql queries. By analyzing the impact of different file formats and compression strategies on the performance in different query types, we conclude that spark sql can better support Parquet, while it does not show obvious advantages for Parquet in Hive as in spark sql. Snappy has a better effect on the intermediate data compression, and relative to ORC, Parquet combined with Snappy has the best performance. Second, we change the default configuration for Hive, adjust the number of Map Reduce, optimize the join strategy, and eliminate the effects of data skew, making Hive performance increases 10% to 75% or more depending on the workload types. Also, we optimize spark sql through the improvement of parallelism and join methods.

关键词： Hive Optimization sql on Hadoop spark sql

来源：评论

学校读者我要写书评

暂无评论

DQN-based Join Order Optimization by Learning Experiences of Running Queries on spark sql 20

DQN-based Join Order Optimization by Learning Experiences of...

引用

20th IEEE International Conference on Data Mining (ICDM)

作者： Lee, Kyeong-Min Kim, InA Lee, Kyu-Chul Chungnam Natl Univ Dept Comp Engn Daejeon South Korea

ISBN: (纸本)9781728190129

In a smart grid, various types of queries such as adhoc queries and analytic queries are requested for data. There is a limit to query evaluation based on a single node database engines because queries are requested for a large scale of data in the smart grid. In this paper, to improve the performance of retrieving a large scale of data in the smart grid environment, we propose a DQN-based join order optimization model on spark sql. The model learns the actual processing time of queries that are evaluated on spark sql, not the estimated costs. By learning the optimal join orders from previous experiences, we optimize the join orders with similar performance to spark sql without collecting and computing the statistics of an input data set.

关键词： smart grid join order optimization spark sql deep reinforcement learning deep q-network

来源：评论

学校读者我要写书评

暂无评论

Rover: An Online spark sql Tuning Service via Generalized Transfer Learning 23

Rover: An Online Spark SQL Tuning Service via Generalized Tr...

引用

29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

作者： Shen, Yu Ren, Xinyuyang Lu, Yupeng Jiang, Huaijun Xu, Huanyong Peng, Di Li, Yang Zhang, Wentao Cui, Bin Peking Univ Sch CS Beijing Peoples R China ByteDance Inc Beijing Peoples R China Peking Univ Ctr Data Sci Beijing Peoples R China Mila Quebec AI Inst Montreal PQ Canada Peking Univ Inst Computat Social Sci Sch CS Beijing Peoples R China

ISBN: (纸本)9798400701030

Distributed data analytic engines like spark are common choices to process massive data in industry. However, the performance of spark sql highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for spark sql tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from spark engineers is of great potential to improve the tuning performance but is not well studied so far;2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online spark sql tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world spark sql tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.

关键词： spark sql Bayesian Optimization Transfer Learning

来源：评论

学校读者我要写书评

暂无评论

Optimization of Row Pattern Matching over Sequence Data in spark sql 30th

Optimization of Row Pattern Matching over Sequence Data in S...

引用

30th International Conference on Database and Expert Systems Applications (DEXA)

作者： Nakabasami, Kosuke Kitagawa, Hiroyuki Nasu, Yuya Railway Tech Res Inst Hikari Cho 2-8-38 Kokubunji Tokyo Japan Univ Tsukuba Ctr Computat Sci Tennodai 1-1-1 Tsukuba Ibaraki Japan Univ Tsukuba Grad Sch Syst & Informat Engn Tennodai 1-1-1 Tsukuba Ibaraki Japan

ISBN: (纸本)9783030276157;9783030276140

Due to the advance of information and communications technology and sensor technology, a large quantity of sequence data (time series data, log data, etc.) are generated and processed every day. Row pattern matching for the sequence data stored in relational databases was standardized as sql/RPR in 2016. Today, in addition to relational databases, there are many frameworks for processing a large amount of data in parallel and distributed computing environments. They include MapReduce and spark. Hive and spark sql enable us to code data analysis processes in sql-like query languages. Row pattern matching is also beneficial in Hive and spark sql. However, computational cost of the row pattern matching process is large and it is needed to make this process efficient. In this paper, we propose two optimization methods to realize the reduction of computational cost for row pattern matching process. We focus on spark and show design and implementation of the proposed methods for spark sql. We verify by the experiments that our optimization methods really contribute to the reduction of the processing time of spark sql queries including row pattern matching.

关键词： Sequence data Pattern matching Row Pattern Recognition MATCH RECOGNIZE spark sql Optimization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：