Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The...
详细信息
ISBN:
(纸本)9781450337168
Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload aware, i.e., results in partitioning layouts that minimize the queryprocessing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.
We introduce a simple data model to process non-relational data for relational operations, and SHC (Apache Spark - Apache HBase Connector), an implementation of this model in the cluster computing framework, Spark. SH...
详细信息
ISBN:
(纸本)9781538655207
We introduce a simple data model to process non-relational data for relational operations, and SHC (Apache Spark - Apache HBase Connector), an implementation of this model in the cluster computing framework, Spark. SHC leverages optimization techniques of relational data processing over the distributed and column-oriented key-value store (i.e., HBase). Compared to existing systems, SHC makes two major contributions. At first, SHC offers a much tighter integration between optimizations of relational data processing and non-relational data store, through a plug-in implementation that integrates with Spark SQL, a distributed in-memory computing engine for relational data. The design makes the system maintenance relatively easy, and enables users to perform complex data analytics on top of key-value store. Second, SHC leverages the Spark SQL Catalyst engine for high performance queryoptimizations and processing, e.g., data partitions pruning, columns pruning, predicates pushdown and data locality. SHC has been deployed and used in multiple production environments with hundreds of nodes, and provides OLAP queryprocessing on petabytes of data efficiently.
A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management ...
详细信息
A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed. We then define a methodology for developing one of the popular architectures of an FDBS. Finally, we discuss critical issues related to developing and operating an FDBS.
The purpose of this talk is to provide a comprehensive state of the art concerning the evolution of data management systems from uni-processor systems to large scale distributed systems. We focus our study on the quer...
详细信息
ISBN:
(纸本)9781450313070
The purpose of this talk is to provide a comprehensive state of the art concerning the evolution of data management systems from uni-processor systems to large scale distributed systems. We focus our study on the query processing and optimization methods. For each environment, we recall their motivations and point out main characteristics of proposed methods, especially, the nature of decision-making (centralized or decentralized control for high level of scalability), adaptive level (intra-operator and/or inter-operator), impact of parallelism (partitioned and pipelined parallelism) and dynamicity (e.g. elasticity) of execution models.
暂无评论