Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this ...
详细信息
ISBN:
(纸本)9781728190747
Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution. Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter c to add in the hash function to minimize makespan. For two different bucketing methods - modulo labeling and consecutive binning - we present efficient algorithms for finding the optimal value of c. Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.
暂无评论