Unlike the event-driven streamprocessing systems, the micro-batch stream processing systems collect input data for a certain period of time before processing. This is because they focus on improving the throughput of...
详细信息
ISBN:
(纸本)9781665443937
Unlike the event-driven streamprocessing systems, the micro-batch stream processing systems collect input data for a certain period of time before processing. This is because they focus on improving the throughput of the entire system rather than reducing the latency of each data. However, ingesting a continuous stream of data and its real-time analysis is also necessary in micro-batch stream processing systems where reducing the latency is more important than improving the throughput. This paper presents Q-Spark, a QoS (Quality of Service) aware micro-batch stream processing system that is implemented on Apache Spark. The main idea of Q-Spark design is to set a deadline time for each query and dynamically adjust the batch size so as not to exceed it. Since Q-Spark executes a micro-batch by buffering as much as possible until the deadline set for each query is exceeded, it guarantees the QoS requirement of each query while maintaining the throughput as much as the original Spark batching mechanism. Experimental results show that the tail latency of Q-Spark is always bound to the deadline compared to the original Spark where data is buffered using triggers for a certain period. As a result, Q-Spark reduces the tail latency per query by up to 75%, while maintaining the throughput stably compared to the original Spark without the concept of a deadline.
Advances in real-world applications require high-throughput processing over large data streams. micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batchi...
详细信息
ISBN:
(纸本)9781450367356
Advances in real-world applications require high-throughput processing over large data streams. micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data streamprocessing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme, termed Prompt is presented that leverages the characteristics of the micro-batchprocessing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.
暂无评论