To process data from IoTs and wearable devices, analysis tasks are often offloaded to the cloud. As the amount of sensing data ever increases, optimizing the dataanalyticsframeworks is critical to the performance of...
详细信息
To process data from IoTs and wearable devices, analysis tasks are often offloaded to the cloud. As the amount of sensing data ever increases, optimizing the dataanalyticsframeworks is critical to the performance of processing sensed data. A key approach to speed up the performance of dataanalyticsframeworks in the cloud is caching intermediate data, which is used repeatedly in iterative computations. Existing analytics engines implement caching with various approaches. Some use run-time mechanisms with dynamic profiling and others rely on programmers to decide data to cache. Even though caching discipline has been investigated long enough in computer system research, recent dataanalyticsframeworks still leave a room to optimize. As sophisticated caching should consider complex execution contexts such as cache capacity, size of data to cache, victims to evict, etc., no general solution often exists for dataanalyticsframeworks. In this paper, we propose an application-specific cost-capacity-aware caching scheme for in-memory dataanalyticsframeworks. We use a cost model, built from multiple representative inputs, and an execution flow analysis, extracted from DAG schedule, to select primary candidates to cache among intermediate data. After the caching candidate is determined, the optimal caching is automatically selected during execution even if the programmers no longer manually determine the caching for the intermediate data. We implemented our scheme in Apache Spark and experimentally evaluated our scheme on HiBench benchmarks. Compared to the caching decisions in the original benchmarks, our scheme increases the performance by 27% on sufficient cache memory and by 11% on insufficient cache memory, respectively.
While there has been a lot of effort in recent years in optimising bigdata systems like Apache Spark and Hadoop, the all-to-all transfer of data between a MapReduce computation step, i.e., the shuffle data mechanism ...
详细信息
ISBN:
(纸本)9781665439022
While there has been a lot of effort in recent years in optimising bigdata systems like Apache Spark and Hadoop, the all-to-all transfer of data between a MapReduce computation step, i.e., the shuffle data mechanism between cluster nodes remains always a serious bottleneck. In this work, we present Cherry, an open-source distributed task-aware Caching sHuffle sErvice for seRveRless analytics. Our thorough experiments on a cloud testbed using realistic and synthetic workloads showcase that Cherry can achieve an almost 23% to 39% reduction in completion of the reduce stage with small shuffle block sizes, a 10% reduction in execution time on real workloads, while it can efficiently handle Spark execution failures with a constant task time re-computation overhead compared to existing approaches.
In this work, we investigate techniques to improve the performance of bigdataanalytics in virtualized clusters by effectively increasing the utilization of cached data and effciently using scarce memory resources.
ISBN:
(纸本)9781450350280
In this work, we investigate techniques to improve the performance of bigdataanalytics in virtualized clusters by effectively increasing the utilization of cached data and effciently using scarce memory resources.
暂无评论