Although the modern dataflows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art dataflow optimizatio...
详细信息
Although the modern dataflows are executed in parallel and distributed environments, e.g. on a multi-core machine or on the cloud, current cost models, e.g., those considered by state-of-the-art data flow optimization techniques, do not accurately reflect the response time of real dataflow execution in these execution environments. This is mainly due to the fact that the impact of parallelism, and more specifically, the impact of concurrent task execution on the running time is not adequately modeled in current cost models. The contribution of this work is twofold. Firstly, we propose an advanced cost model that aims to reflect the response time of a dataflow that is executed in parallel more accurately. Secondly, we show that existing optimization solutions are inadequate and develop new optimization techniques targeting the proposed cost model. We focus on the single multi-core machine environment provided by modern business intelligence tools, such as Pentaho Kettle, but our approach can be extended to massively parallel and distributed settings. The distinctive features of our proposal is that we model both time overlaps and the impact of concurrency on task running times in a combined manner;the latter is appropriately quantified and its significance is exemplified. Furthermore, we propose extensions to current optimizers that decide on the exact ordering of flow tasks taking into account the new optimization metric. Finally, we evaluate the new optimization algorithms and show up to 59% response time improvement over state-of-the-art task ordering techniques.
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel applicat...
详细信息
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (UDFS). However, the heavy use of UDFS is not well taken into account for data flow optimization in current systems. SOFA is a novel and extensible optimizer for UDF-heavy dataflows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFS and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: we arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate SOFA on a selection of hop-heavy dataflows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that SOFA finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six. (C) 2015 Elsevier Ltd. All rights reserved.
Modern software differs significantly from traditional computer applications that mostly process reasonably small amounts of static input data-sets in batch mode. Modern software increasingly processes massive amounts...
详细信息
Modern software differs significantly from traditional computer applications that mostly process reasonably small amounts of static input data-sets in batch mode. Modern software increasingly processes massive amounts of data, whereby it is also often the case that new input data is produced and/or existing data is modified on the fly. Consequently, programming models that facilitate the development of such software are emerging. What characterizes them is that data, respectively changes thereof, implicitly flow through computation modules. The software engineer declaratively defines computations as compositions of other computations without explicitly modeling how data should flow along dependency relations between data producer and data consumer modules, letting the runtime to automatically manage and optimize dataflows.
Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of par...
详细信息
Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage;complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.
This paper proposes a novel sub-architecture to optimize the dataflow of REMUS-II (REconfigurable MUltimedia System 2), a dynamically coarse grain reconfigurable architecture. REMUS-II consists of a mu PU (Micro-Proc...
详细信息
This paper proposes a novel sub-architecture to optimize the dataflow of REMUS-II (REconfigurable MUltimedia System 2), a dynamically coarse grain reconfigurable architecture. REMUS-II consists of a mu PU (Micro-Processor Unit) and two RPUs (Reconfigurable Processor Unit), which are used to speeds up control-intensive tasks and data-intensive tasks respectively. The parallel computing capability and flexibility of REMUS-II makes itself an excellent candidate to process multimedia applications, which require a large amount of memory accesses. In this paper, we specifically optimize the dataflow to deal with those performance-hazard and energy-hungry memory accessing in order to meet the bandwidth requirement of parallel computing. The RPU internal memory could work in multiple modes, like 2D-access mode and transformation mode, according to different multimedia access patterns. This novel design can improve the performance up to 26% compared to traditional on-chip memory. Meanwhile, the block buffer is implemented to optimize the off-chip dataflow through reducing off-chip memory accesses, which reducing up to 43% compared to direct DDR access. Based on RTL simulation, REMUS-II can achieve 1080p@30 fps of H.264 High Profile@ Level 4 and High Level MPEG2 at 200 MHz clock frequency. The REMUS-II is implemented into 23.7 mm(2) silicon on TSMC 65 nm logic process with a 400 MHz maximum working frequency.
Our focus in this paper is on enabling the decoupling of dataflow, data exchange and management from the control flow in service compositions and choreographies through novel middleware abstractions and realization. ...
详细信息
ISBN:
(纸本)9783319387918;9783319387901
Our focus in this paper is on enabling the decoupling of dataflow, data exchange and management from the control flow in service compositions and choreographies through novel middleware abstractions and realization. This allows us to perform the dataflow of choreographies in a peer-to-peer fashion decoupled from their control flow. Our work is motivated by the increasing importance and business value of data in the fields of business process management, scientific workflows and the Internet of Things, all of which profiting from the recent advances in data science and Big data. Our approach comprises an application life cycle that inherently introduces data exchange and management as a first-class citizen and defines the functions and artifacts necessary for enabling transparent data exchange. Moreover, we present an architecture of the supporting system that contains the Transparent data Exchange middleware, which enables the data exchange and management on behalf of service choreographies and provides methods for the optimization of the data exchange during their execution.
This work is motivated by the increasing importance and business value of data in the fields of business process management, scientific workflows as a field in eScience, and Internet of Things, all of which profiting ...
详细信息
ISBN:
(纸本)9781509026753
This work is motivated by the increasing importance and business value of data in the fields of business process management, scientific workflows as a field in eScience, and Internet of Things, all of which profiting from the recent advances in data science and Big data. We introduce a management life cycle that renders data as first-class citizen in service choreographies and defines the functions and artifacts necessary for enabling transparent and efficient data exchange among choreography participants. The inherent goal of the life cycle, functions and artifacts is to help decouple the dataflow, data exchange and management from the control flow in service compositions and choreographies. This decoupling enables peer-to-peer data exchange in choreographies and provides the means for more sophisticated data management and exchange, as well as data exchange and provisioning optimization.
This paper identifies the most significant learning factors impacting undergraduate academic performance using artificial neural networks (ANNs) and controlled student data collection. As higher education becomes incr...
详细信息
暂无评论