This paper focuses on parallel query optimization. We consider the operator problem and introduce a new class of execution strategies called Linear-oriented Bushy Trees (LBT). Compared to the related approach of the G...
详细信息
This paper focuses on parallel query optimization. We consider the operator problem and introduce a new class of execution strategies called Linear-oriented Bushy Trees (LBT). Compared to the related approach of the General Bushy Trees (GBT) a significant complexity reduction of the operator ordering problem can be derived theoretically and demonstrated experimentally (e.g, compared with GBTs, LBTs authorize optimization time improvement that can reach up to 49%) without losing quality. Finally we demonstrate that existing commercial parallel query optimizers need little extension mod ifications in order to handle LBTs. (C) 2000 Elsevier Science B.V. All rights reserved.
The LOGFLOW parallel Prolog system is similar to the recent parallel database systems concerning its dataflow execution model and its capability of running on shared-nothing architectures. In this paper the abstract e...
详细信息
The LOGFLOW parallel Prolog system is similar to the recent parallel database systems concerning its dataflow execution model and its capability of running on shared-nothing architectures. In this paper the abstract execution and abstract machine models of LOGFLOW are examined from a database point of view. Transformations of relational operators into the Logicflow Graph representation of Prolog programs an explained. Thus, LOGFLOW can operate as a relational database machine. (C) 2000 Published by Elsevier Science B.V. All rights reserved.
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typic...
详细信息
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.
A description and discussion of the SciDB database management system focuses on lessons learned, application areas, performance comparisons against other solutions, and additional approaches to managing data and compl...
详细信息
A description and discussion of the SciDB database management system focuses on lessons learned, application areas, performance comparisons against other solutions, and additional approaches to managing data and complex analytics.
Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow ...
详细信息
Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach: they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm APM has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm APM has been implemented on a Sun Enterprise 4000 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good effects and APM has a substantial lead in performance over other proposed level-wise algorithms.
We have produced a suite of software tools which has proved to be highly valuable in the porting of large, complex relational database applications from conventional to parallel systems. These tools are being continua...
详细信息
We have produced a suite of software tools which has proved to be highly valuable in the porting of large, complex relational database applications from conventional to parallel systems. These tools are being continually developed to extend their functionality and genericity over a range of platforms, and provide a basis for the design, implementation, optimisation and management of any parallel database system. A considerable amount of the development has been performed using a Meiko Relational DataCache - a multi-spare, multi-transputer ORACLE system. In this paper we present a summary of our activities, beginning with a brief description of the Meiko system (as currently configured), followed by a more detailed account of some of the tools, and finishing with some general comments on future work.
In order to re-adjust the parallel execution of SQL queries in case of metric estimation or discretization errors, we propose an incremental parallelization method which carries out simultaneously both scheduling and ...
详细信息
In order to re-adjust the parallel execution of SQL queries in case of metric estimation or discretization errors, we propose an incremental parallelization method which carries out simultaneously both scheduling and mapping in co-operation with two incremental memory allocation heuristics (ParAd: parallelism degree adjustment, and MaCRelax: mapping clues relaxation) in a dynamic multi-user context. The two incremental memory allocation heuristics are integrated in the mapping method which attempt to avoid time-consuming multi-bucket join execution generating numerous additional I/O. A performance evaluation of the ParAd heuristic shows: (i) a significant join response time savings (from 16.11% to 35.62%), and (ii) with many complex queries, a more significant gain in response time (from 29% to 54%). (C) 2002 Elsevier Science B.V. All rights reserved.
We propose a new declustering scheme for allocating uniform multidimensional data among parallel disks. The scheme, aimed at reducing disk access time for range queries, is based on Golden Ratio Sequences for two dime...
详细信息
We propose a new declustering scheme for allocating uniform multidimensional data among parallel disks. The scheme, aimed at reducing disk access time for range queries, is based on Golden Ratio Sequences for two dimensions and Kronecker Sequences for higher dimensions. Using exhaustive simulation, we show that, in-two dimensions, the worst-case (additive) deviation of the scheme from the optimal response time for any range query is one when the number of disks (M) is at most 22;its worst-case deviation is two when M less than or equal to 94;and its worst-case deviation is four when M less than or equal to 550. In two dimensions, we prove that whenever M is a. Fibonacci number, the average performance of the scheme is within 14 percent of the (generally, unachievable) strictly optimal scheme and its worst-case response time is within a multiplicative factor three of the optimal response time for any query, and within a factor 1.5 of the optimal for large queries. We also present comprehensive simulation results, on two-dimensional as well as on higher-dimensional data, that compare and demonstrate the advantages of our scheme over some recently proposed schemes in the literature.
Cogset is a generic and efficient engine for reliable storage and parallel processing of distributed data sets. It supports a number of high-level programming interfaces, including a MapReduce interface compatible wit...
详细信息
Cogset is a generic and efficient engine for reliable storage and parallel processing of distributed data sets. It supports a number of high-level programming interfaces, including a MapReduce interface compatible with Hadoop. In this paper, we present Cogset's architecture and evaluate its performance as a MapReduce engine, comparing it with Hadoop. Our results show that Cogset generally outperforms Hadoop by a significant margin. We investigate the underlying causes of this difference in performance and demonstrate some relatively minor modifications that markedly improve Hadoop's performance, closing some of the gap. Copyright (c) 2012 John Wiley & Sons, Ltd.
parallelizing I/O operations via effective declustering of data is becoming essential to scale up the performance of parallel databases or high performance systems. Declustering has been shown to be a NP-complete prob...
详细信息
parallelizing I/O operations via effective declustering of data is becoming essential to scale up the performance of parallel databases or high performance systems. Declustering has been shown to be a NP-complete problem in some contexts. Some heuristic methods have been proposed to solve this problem. However, most methods are not effective in several cases such as queries with different access frequencies or data with different sizes. In this paper, we propose a hypergraph model to formulate the declustering problem. Several interesting theoretical results are achieved by analyzing the proposed model. The proposed approach will allow modeling a wide range of declustering problems. Furthermore, the hypergraph declustering model is used as the basis to develop new heuristic methods, including a greedy method and a hybrid declustering method. Experiments show that the proposed methods can achieve better performance than several declustering methods.
暂无评论