We propose a new model for data processing programs. Our model generalizes the dataflowprogramming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink. The model uses directed ...
详细信息
ISBN:
(纸本)9783030638818;9783030638825
We propose a new model for data processing programs. Our model generalizes the dataflowprogramming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink. The model uses directed acyclic graphs (DAGs) to represent the main aspects of dataflow-based systems, namely, operations over data (filtering, aggregation, join) and program execution, defined by data dependence between operations. We use Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the dataflow. This approach allows the data processing program specification to be agnostic of the target Big data processing system. As a first application of the model, we used it to formalize mutation operators for the application of mutation testing in Big data processing programs. The testing tool TRANSMUT-Spark implement these operators.
This paper proposes a model for specifying dataflow-based parallel data processing programs agnostic of target Big data processing frameworks. The paper focuses on the formal abstract specification of non-iterative a...
详细信息
This paper proposes a model for specifying dataflow-based parallel data processing programs agnostic of target Big data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by dataflow Big data processing frameworks. The proposed model relies on Monoid Algebra and Petri Nets to abstract Big data processing programs in two levels: a higher level representing the program dataflow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], for modeling iterative data processing programs. The general specification of these programs implemented by dataflow-based parallel programmingmodels is essential given the democratization of iterative and greedy Big data analytics algorithms. Indeed, these algorithms call for revisiting parallel programmingmodels to express iterations. The paper gives a comparative analysis of the iteration strategies proposed by Apache Spark, DryadLINQ, Apache Beam, and Apache Flink. It discusses how the model achieves to generalize these strategies. (c) 2021 Elsevier B.V. All rights reserved.
暂无评论