sql query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQI, support over Hadoop, (live is the first native Hadoop system that uses an underlying...
详细信息
sql query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQI, support over Hadoop, (live is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process sql-like statements. Impala, on the other hand, represents the new emerging class of sql-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.
Due to the characteristics of P2P systems, sql query processing in these systems is more complex than traditional distributed DBMS. In this context, semantic and structural heterogeneity of local schemas prevent peers...
详细信息
ISBN:
(纸本)9781424417513
Due to the characteristics of P2P systems, sql query processing in these systems is more complex than traditional distributed DBMS. In this context, semantic and structural heterogeneity of local schemas prevent peers to exchange their data in a comprehensive way. Schema heterogeneity could lead to incorrect answers for the localization query. Furthermore, in the optimization phase, the lack of information and the obsolete statistics found in local catalogs make the execution plan suboptimal. In this paper, we propose a new approach to sql query processing in P2P environments. The main features of proposed approach are: (i) Avoiding any centralized structures of the peers participating in the system, (ii) Integrating a Domain Ontology which ensures a comprehensive data exchange in Chord protocol that guarantees efficient locating of data sources, and (iii) Extending the localization phase to be able to obtain information needed in the optimization phase. This information is more reliable than the statistics found in the local catalogs. We describe our approach in detail with examples.
Due to the characteristics of P2P systems, sql query processing in these systems is more complex than traditional distributed DBMS. In this context, semantic and structural heterogeneity of local schemas prevent peers...
详细信息
Due to the characteristics of P2P systems, sql query processing in these systems is more complex than traditional distributed DBMS. In this context, semantic and structural heterogeneity of local schemas prevent peers to exchange their data in a comprehensive way. Schema heterogeneity could lead to incorrect answers for the localization query. Furthermore, in the optimization phase, the lack of information and the obsolete statistics found in local catalogs make the execution plan suboptimal. In this paper, we propose a new approach to sql query processing in P2P environments. The main features of proposed approach are: (i) Avoiding any centralized structures of the peers participating in the system, (ii) Integrating a Domain Ontology which ensures a comprehensive data exchange in Chord protocol that guarantees efficient locating of data sources, and (iii) Extending the localization phase to be able to obtain information needed in the optimization phase. This information is more reliable than the statistics found in the local catalogs. We describe our approach in detail with examples.
暂无评论