Distributed dataflow systems have been developed to help users analyze and process large datasets. While they make it easier for users to develop massively-parallel programs, users still have to choose the amount of r...
详细信息
ISBN:
(纸本)9781538678992
Distributed dataflow systems have been developed to help users analyze and process large datasets. While they make it easier for users to develop massively-parallel programs, users still have to choose the amount of resources for the execution of their jobs. Yet, users do not necessarily understand workload and system dynamics, while they often have constraints like runtime targets and budgets. Addressing this problem, systems have been developed that automatically select the required amount of resources to fulfill the users' constraints. However, interference with co-located workloads can introduce a significant variance into the runtimes of jobs and make accurate runtime prediction harder. This paper presents CoBell, a resource allocation system that incorporates information about co-located workloads to improve the runtime prediction for jobs in shared clusters. CoBell receives jobs from users with runtime and scale-out constraints and then reserves resources based on predicted runtimes. We implemented CoBell as a job submission tool for YARN. As such, it works with existing YARN cluster setups. The paper evaluates CoBell using five different distributed dataflow jobs, showing that using CoBell results in runtimes that do not violate the runtime constraints by more than 7.2%.
Resource management systems like YARN or Mesos enable users to share cluster infrastructures by running analytics jobs in temporarily reserved containers. These containers are typically not isolated to achieve high de...
详细信息
ISBN:
(纸本)9781538619964
Resource management systems like YARN or Mesos enable users to share cluster infrastructures by running analytics jobs in temporarily reserved containers. These containers are typically not isolated to achieve high degrees of overall resource utilizations despite the often fluctuating resource usage of single analytic jobs. However, some combinations of jobs utilize the resources better and interfere less with each others when running on the same nodes than others. This paper presents an approach for improving the resource utilization and job throughput when scheduling recurring data analysis jobs in shared cluster environments. Using a reinforcement learning algorithm, the scheduler continuously learns which jobs are best executed simultaneously on the cluster. Our evaluation of an implementation built on Hadoop YARN shows that this approach can increase resource utilization and decrease job runtimes. While interference between jobs can be avoided, co-locations of jobs with complementary resource usage are not yet always fully recognized. However, with a better measure of co-location goodness, our solution can be used to automatically adapt the scheduling to workloads with recurring batch jobs.
Distributed dataflow systems like MapReduce, Spark, and Flink help users in analyzing large datasets with a set of cluster resources. Performance modeling and runtime prediction is then used for automatically allocati...
详细信息
ISBN:
(纸本)9781538606926
Distributed dataflow systems like MapReduce, Spark, and Flink help users in analyzing large datasets with a set of cluster resources. Performance modeling and runtime prediction is then used for automatically allocating resources for specific performance goals. However, the actual performance of distributed dataflow jobs can vary significantly due to factors like interference with co-located workloads, varying degrees of data locality, and failures. We address this problem with Ellis, a system that allocates an initial set of resources for a specific runtime target, yet also continuously monitors a job's progress towards the target and if necessary dynamically adjusts the allocation. For this, Ellis models the scale-out behavior of individual stages of distributed dataflow jobs based on previous executions. Our evaluation of Ellis with iterative Spark jobs shows that dynamic adjustments can reduce the number of constraint violations by 30.7-75.0% and the magnitude of constraint violations by 70.6-94.5%.
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalableanalytics frameworks have incorporated differentially...
详细信息
ISBN:
(纸本)9781450341974
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalableanalytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address both issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of output perturbation to devise a novel "bolt-on" approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the L-2-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.
Distributed dataflow systems like Spark and Flink allow to analyze large datasets using clusters of computers. These frameworks provide automatic program parallelization and manage distributed workers, including worke...
详细信息
ISBN:
(纸本)9781467390057
Distributed dataflow systems like Spark and Flink allow to analyze large datasets using clusters of computers. These frameworks provide automatic program parallelization and manage distributed workers, including worker failures. Moreover, they provide high-level programming abstractions and execute programs efficiently. Yet, the programming abstractions remain textual while the dataflow model is essentially a graph of transformations. Thus, there is a mismatch between the presented abstraction and the underlying model here. One can also argue that developing dataflow programs with these textual abstractions requires needless amounts of coding and coding skills. A dedicated programming environment could instead allow constructing dataflow programs more interactively and visually. In this paper, we therefore investigate how visual programming can make the development of parallel dataflow programs more accessible. In particular, we built a prototypical visual programming environment for Flink, which we call Flision. Flision provides a graphical user interface for creating dataflow programs, a code generation engine that generates code for Flink, and seamless deployment to a connected cluster. Users of this environment can effectively create jobs by dragging, dropping, and visually connecting operator components. To evaluate the applicability of this approach, we interviewed ten potential users. Our impressions from this qualitative user testing strengthened our believe that visual programming can be a valuable tool for users of scalabledata analysis tools.
Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users create programs by providing sequential user-defined functions for a set of well-defined operations, select a set of resou...
详细信息
ISBN:
(纸本)9781509052523
Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users create programs by providing sequential user-defined functions for a set of well-defined operations, select a set of resources, and the systems automatically distribute the jobs across these resources. However, selecting resources for specific performance needs is inherently difficult and users consequently tend to overprovision, which results in poor cluster utilization. At the same time, many important jobs are executed recurringly in production clusters. This paper presents Bell, a practical system that monitors job execution, models the scale-out behavior of jobs based on previous runs, and selects resources according to user-provided runtime targets. Bell automatically chooses between different runtime prediction models to optimally support different distributed dataflow systems. Bell is implemented as a job submission tool for YARN and, thus, works with existing cluster setups. We evaluated Bell's runtime prediction with six exemplary dataanalytics jobs using both Spark and Flink. We present the learned scale-out models for these jobs and evaluate the relative prediction error using cross-validation, showing that our model selection approach provides better overall performance than the individual prediction models.
There is an increasing demand from businesses andindustries to make the best use of their data. Clustering is apowerful tool for discovering natural groupings in data. Thek-means algorithm is the most commonly-used da...
详细信息
There is an increasing demand from businesses andindustries to make the best use of their data. Clustering is apowerful tool for discovering natural groupings in data. Thek-means algorithm is the most commonly-used data clustering method,having gained popularity for its effectiveness on various data setsand ease of implementation on different computing architectures. Itassumes, however, that data are available in an attribute-valueformat, and that each data instance can be represented as a vectorin a feature space where the algorithm can be applied. Theseassumptions are impractical for real data, and they hinder the useof complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clusteringwhich extends the k-means algorithm to work on a similarity matrixover complex data structures. The kernel k-means algorithm ishowever computationally very complex as it requires the completedata matrix to be calculated and stored. Further, the kernelizednature of the kernel k-means algorithm hinders the parallelizationof its computations on modern infrastructures for distributedcomputing. This thesis defines a family of kernel-basedlow-dimensional embeddings that allows for scaling kernel k-meanson MapReduce via an efficient and unified parallelization ***, three practical methods for low-dimensional embedding thatadhere to our definition of the embedding family are *** the proposed parallelization strategy with any of thethree embedding methods constitutes a complete scalable andefficient MapReduce algorithm for kernel k-means. The efficiencyand the scalability of the presented algorithms are demonstratedanalytically and empirically
scalable visualization of big data is highly desired in several fields. A demonstrative example is annotated genetic data, where the DNA sequence can be better visualized using space-filling fractal curves. Though suc...
详细信息
ISBN:
(纸本)9781479915439;9781479915460
scalable visualization of big data is highly desired in several fields. A demonstrative example is annotated genetic data, where the DNA sequence can be better visualized using space-filling fractal curves. Though such approaches are already available in the literature, they are not applied routinely in the clinical practice. The reason of this undesired fact is that clinicians use well-established, but rather old gene browsers in their work. Naturally, these browsers lack state-of-the-art programming support and also visualization features. To help with this problem, in this paper, we propose a completion of a very widely used genome browser with scalable visualization techniques. The motivation of the work came from the clinical partner, since a good overview of the whole annotated DNA segment can lead to the recognition of currently unknown relations between genes. We explain what steps were needed to add the new visualization feature to the browser in terms of graphical elements, data extraction, and a scalable user interface.
暂无评论