mapreduce-basedsql processing systems, e.g., Hive and Spark sql, are widely used for big data analytic applications due to automatic parallel processing on large-scale machines. They provide high processing performan...
详细信息
ISBN:
(纸本)9781728190747
mapreduce-basedsql processing systems, e.g., Hive and Spark sql, are widely used for big data analytic applications due to automatic parallel processing on large-scale machines. They provide high processing performance when loads are balanced across the machines. However, skew loads are not rare in real applications. Although many efforts have been made to address the skew issue in mapreduce-based systems, they can neither fully exploit all available computing resources nor handle skews in sql processing. Moreover, none of them can expedite the processing of skew partitions in case of failures. In this paper, we present SrSpark, a mapreduce-basedsql processing system that can make full use of all computing resources for both non-skew loads and skew loads. To achieve this goal, SrSpark introduces fine-grained processing and work-stealing into the mapreduce framework. More specifically, SrSpark is implemented based on Spark sql. In SrSpark, partitions are further divided into sub-partitions and processed in sub-partition granularity. Moreover, SrSpark adaptively uses both intra-node and inter-node parallel processing for skew loads according to available computing resources in real-time. Such adaptive parallel processing increases the degree of parallelism and reduces the interaction overheads among the cooperative worker threads. In addition, SrSpark checkpoints sub-partition's processing results periodically to ensure fast recovery from failures during skew partition processing. Our experiment results show that for skew loads, SrSpark outperforms Spark sql by up to 3.5x, and 2.2x on average, while the performance overhead is only about 4% under non-skew loads.
暂无评论