distributed query processing has become essential in today's scenario to address the changing business needs of users. It aims to arrive at an optimal queryprocessing plan for a given distributedquery. This is a...
详细信息
ISBN:
(纸本)9780769539584
distributed query processing has become essential in today's scenario to address the changing business needs of users. It aims to arrive at an optimal queryprocessing plan for a given distributedquery. This is a complex process as the number of possible queryprocessing plans grows rapidly with increase in the number of sites used, and relations accessed, by the query. Therefore, there is a need to determine optimal queryprocessing plans among all possible plans. The approach presented in this paper attempts to generate such optimal queryprocessing plans using genetic algorithm. As per the approach, the query plans having the required data residing close to each other are considered more efficient and, therefore, are generated. These generated query plans would result in efficient queryprocessing. Further, experimental results show that the approach is able to generate such optimal queryprocessing plans in a fewer number of generations.
With the deluge of scientific big data affecting a large variety of research institutions, support for large multidimensional arrays has gained traction in the database community in the past decade. Array databases ai...
详细信息
ISBN:
(纸本)9781450342155
With the deluge of scientific big data affecting a large variety of research institutions, support for large multidimensional arrays has gained traction in the database community in the past decade. Array databases aim to cover the gap left by traditional relational database systems in the domains of large scientific data by enabling researchers to efficiently store and process their data through rich declarative query languages. Such large amounts of data need effective systems that are able to distribute the processing at both local level, through exploitation of heterogeneous hardware as well as at network level, enabling both intra-cloud and intra-federation distribution of data and processing. In this demonstration we aim to showcase the capabilities of rasdaman by allowing users to execute queries that combine petabyte datasets stored at two institutions on different continents.
An in-memory database cluster consists of multiple interconnected nodes with a large capacity of RAM and modern multi-core CPUs. As a conventional queryprocessing strategy, pipelining remains a promising solution for...
详细信息
ISBN:
(纸本)9781450335317
An in-memory database cluster consists of multiple interconnected nodes with a large capacity of RAM and modern multi-core CPUs. As a conventional queryprocessing strategy, pipelining remains a promising solution for in-memory parallel database systems, as it avoids expensive intermediate result materialization and parallelizes the data processing among nodes. However, to fully unleash the power of pipelining in a cluster with multi-core nodes, it is crucial for the query optimizer to generate good query plans with appropriate intra-node parallelism, in order to maximize CPU and network bandwidth utilization. A suboptimal plan, on the contrary, causes load imbalance in the pipelines and consequently degrades the query performance. Parallelism assignment optimization at compile time is nearly impossible, as the workload in each node is affected by numerous factors and is highly dynamic during query evaluation. To tackle this problem, we propose elastic pipelining, which makes it possible to optimize intra-node parallelism assignments in the pipelines based on the actual workload at runtime. It is achieved with the adoption of new elastic iterator model and a fully optimized dynamic scheduler. The elastic iterator model generally upgrades traditional iterator model with new dynamic multi-core execution adjustment capability. And the dynamic scheduler efficiently provisions CPU cores to query execution segments in the pipelines based on the light-weight measurements on the operators. Extensive experiments on real and synthetic (TPC-H) data show that our proposal achieves almost full CPU utilization on typical decision-making analytical queries, outperforming state-of-the-art open-source systems by a huge margin.
In the quest for valuable information, modern big data applications continuously monitor streams of data. These applications demand low latency stream processing even when faced with high volume and velocity of incomi...
详细信息
ISBN:
(纸本)9781450335317
In the quest for valuable information, modern big data applications continuously monitor streams of data. These applications demand low latency stream processing even when faced with high volume and velocity of incoming changes and the user's desire to ask complex queries. In this paper, we study low-latency incremental computation of complex SQL queries in both local and distributed streaming environments. We develop a technique for the efficient incrementalization of queries with nested aggregates for batch updates. We identify the cases in which batch processing can boost the performance of incremental view maintenance but also demonstrate that tuple-at-a-time processing often can achieve better performance in local mode. Batch updates are essential for enabling distributed incremental view maintenance and amortizing the cost of network communication and synchronization. We show how to derive incremental programs optimized for running on large-scale processing platforms. Our implementation of distributed incremental view maintenance can process tens of million of tuples with few-second latency using hundreds of nodes.
Networked systems, such as telecom networks and cloud infrastructures, generate and hold vast amounts of configuration and operational data. The goal of this work is to make all this data available through a real-time...
详细信息
ISBN:
(纸本)9781509002238
Networked systems, such as telecom networks and cloud infrastructures, generate and hold vast amounts of configuration and operational data. The goal of this work is to make all this data available through a real-time search process named network search, which will enable new real-time management solutions. The thesis contains several contributions towards engineering a network search system. Key elements of our design are a weakly structured information model that includes spatial properties, a query language that supports location-and schema-oblivious search queries, a peer-to-peer architecture, an echo protocols for scalable queryprocessing, and an indexing protocol for efficient routing for spatial queries. The data against which network search is performed is maintained in local realtime databases close to the data sources. The design follows a bottom-up approach in the sense that the topology for query routing is constructed from the underlying network topology. We have built a prototype of the system on a cloud testbed and developed applications that use network search functionality. Testbed measurements suggest that it is feasible to engineer a network search system that processes queries at low latency and low overhead and that can scale to 100'000 nodes. Simulation results for spatial queries show that queryprocessing achieves response times and incurs overhead close to an optimal protocol, and that it remains accurate under significant churn.
Central to many applications involving moving objects is the task of processing k-nearest neighbor (k-NN) queries. Most of the existing approaches to this problem are designed for the centralized setting where query p...
详细信息
Central to many applications involving moving objects is the task of processing k-nearest neighbor (k-NN) queries. Most of the existing approaches to this problem are designed for the centralized setting where queryprocessing takes place on a single server;it is difficult, if not impossible, for them to scale to a distributed setting to handle the vast volume of data and concurrent queries that are increasingly common in those applications. To address this problem, we propose a suite of solutions that can support scalable distributedprocessing of k-NN queries. We first present a new index structure called Dynamic Strip Index (DSI), which can better adapt to different data distributions than exiting grid indexes. Moreover, it can be naturally distributed across the cluster, therefore lending itself well to distributedprocessing. We further propose a distributed k-NN search (DKNN) algorithm based on DSI. DKNN avoids having an uncertain number of potentially expensive iterations, and is thus more efficient and more predictable than existing approaches. DSI and DKNN are implemented on Apache S4, an open-source platform for distributed stream processing. We perform extensive experiments to study the characteristics of DSI and DKNN, and compare them with three baseline methods. Experimental results show that our proposal scales well and significantly outperforms the alternative methods.
In this paper, we propose a new algorithm for fault-tolerant resource allocation for queryprocessing in grid environments. For this, we propose an initial resource allocation algorithm followed by a fault-tolerance p...
详细信息
In this paper, we propose a new algorithm for fault-tolerant resource allocation for queryprocessing in grid environments. For this, we propose an initial resource allocation algorithm followed by a fault-tolerance protocol. The proposed fault-tolerance protocol is based on the passive replication of stateful operators in queries. We provide theoretical analyses of the proposed algorithms and consolidate our analyses with the simulations.
Subgraph query (via subgraph isomorphism) is a fundamental and powerful query in various real graph applications. It has actively been investigated for performance enhancements recently. However, due to the high compl...
详细信息
ISBN:
(纸本)9781509020218
Subgraph query (via subgraph isomorphism) is a fundamental and powerful query in various real graph applications. It has actively been investigated for performance enhancements recently. However, due to the high complexity of subgraph query, hosting efficient subgraph query services has been a technically challenging task, because the owners of graph data may not always possess the IT expertise to offer such services and hence may outsource to query service providers (SP). SPs are often equipped with high performance computing utilities (e.g., a cloud) that offer better scalability, elasticity and IT management. Unfortunately, as SPs may not always be trusted, security (such as the confidentiality of messages exchanged) has been recognized as one of the critical attributes of Quality of Services (QoS) [4]. This influences the willingness of both data owners and query clients to use SP's services. Recently, there is a bloom on the research on queryprocessing with privacy preservation, e.g., in the context of relational databases, spatial databases and graph databases. However, up to date, private subgraph query has not yet been studied.
This work introduces decentralized queryprocessing techniques based on MIDAS, a novel distributed multidimensional index. In particular, MIDAS implements a distributed k-d tree, where leaves correspond to peers, and ...
详细信息
This work introduces decentralized queryprocessing techniques based on MIDAS, a novel distributed multidimensional index. In particular, MIDAS implements a distributed k-d tree, where leaves correspond to peers, and internal nodes dictate message routing. MIDAS requires that peers maintain little network information, and features mechanisms that support fault tolerance and load balancing. The proposed algorithms process point and range queries over the multidimensional indexed space in only O(log n) hops in expectance, where n is the network size. For nearest neighbor queries, two processing alternatives are discussed. The first, termed eager processing, has low latency (expected value of O(log n) hops) but may involve a large number of peers. The second, termed iterative processing, has higher latency (expected value of O(log(2) n) hops) but involves far fewer peers. A detailed experimental evaluation demonstrates that our queryprocessing techniques outperform existing methods for settings involving real spatial data as well as in the case of high dimensional synthetic data.
Information in networked systems often has spatial properties: routers, sensors, or virtual machines have coordinates in a geographical or virtual space, for instance. In this paper, we propose a peer-to-peer design f...
详细信息
ISBN:
(纸本)9783901882777
Information in networked systems often has spatial properties: routers, sensors, or virtual machines have coordinates in a geographical or virtual space, for instance. In this paper, we propose a peer-to-peer design for a spatial search system that processes queries, such as range or nearest-neighbor queries, on spatial information cached on nodes inside a networked system. Key to our design is a protocol that creates a distributed index of object locations and adapts to object and node churn. The index builds upon the concept of the minimum bounding rectangle, to efficiently encode a large set of locations. We present a search protocol, which is based on an echo protocol and performs query routing. Simulations show the efticiency of the protocol in pruning the search space, thereby reducing the protocol overhead. For many queries, the protocol efficiency increases with the network size and approaches that of an optimal protocol for large systems. The protocol overhead depends on the network topology and is lower if neighboring nodes are spatially close. As a key difference to works in spatial databases, our design is bottom-up, which makes query routing network-aware and thus efficient in networked systems.
暂无评论