mapreduce is a widely used parallel computing paradigm for the big data realm on the scale of terabytes and higher. The introduction of minimal mapreduce algorithms promised efficiency in load balancing among particip...
详细信息
ISBN:
(数字)9783030399511
ISBN:
(纸本)9783030399511;9783030399504
mapreduce is a widely used parallel computing paradigm for the big data realm on the scale of terabytes and higher. The introduction of minimal mapreduce algorithms promised efficiency in load balancing among participating machines by ensuring that partition skew (where some machines end up processing a significantly larger fraction of the input than other machines) is prevented. Despite minimal mapreduce algorithms guarantee of load-balancing within constant multiplicative factors, the constants are relatively large which severely diminishes the theoretical appeal for true efficiency at scale. We introduce the notion of strongly minimal mapreduce algorithms that provide strong guarantees of parallelization up to a small additive factor that diminishes with an increasing number of machines. We show that a strongly minimalmapreduce algorithm exists for sorting;this leads to strongly minimalalgorithms for several fundamental database algorithms and operations that crucially rely on sorting as a primitive. Our techniques are general and apply beyond the analysis of strongly minimal mapreduce algorithms;we show that given a sufficiently high, but still realistic, sampling rate, the approximate partitions obtained from a particular sampling strategy are almost as good as the partitions produced by an ideal partitioning.
暂无评论