Research in atmospheric physics, meteorology, and weather prediction requires the processing of very large multi-dimensional observational or modeled datasets on a daily basis. One of the numerous existing array engin...
详细信息
ISBN:
(纸本)9781450396677
Research in atmospheric physics, meteorology, and weather prediction requires the processing of very large multi-dimensional observational or modeled datasets on a daily basis. One of the numerous existing array engines looks like the natural choice for this task. Interestingly, the actual data analysis situation in the community looks surprisingly different: Researchers often process their data manually using hand-written Python or Julia scripts that directly operate on the raw data files. This results in poor performance due to a lack of data-driven optimizations, as well as poor scalability due to being restricted to a single physical machine. Reasons for this trend lie in the high complexity and upfront effort associated with any specialized system: Distributed large-scale engines must be set up carefully and data must be be converted/transferred into the the proprietary representation of the system. The users, who are typically not computer scientists or data management experts, must adopt and use a specialized multi-dimensional query language to formulate their analytical tasks. As a counter-measure, in this work, we present Northlight, a query processing engine for atmospheric datasets that is (a) easy to adopt for the Earth science community while (b) providing domain-specific automatic query optimization. Northlight is built on top of the established sparksql dataflow engine and connects to atmospheric datasets stored in multi-dimensional NetCDF files. As a consequence, it becomes possible to process these datasets simply via conventional SQL, which is sufficient for a large variety of analysis tasks in the community. At the same time, Northlight provides automatic query optimization specifically tailored towards the processing of observational datasets. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms a comparable pipeline by up to a factor of 6x.
A method for query optimization is presented by utilizing Spark SQL, a module of Apache Spark that integrates relational data processing. The goal of this paper is to explore NoSQL databases and their effective usage ...
详细信息
ISBN:
(数字)9783031334375
ISBN:
(纸本)9783031334368;9783031334375
A method for query optimization is presented by utilizing Spark SQL, a module of Apache Spark that integrates relational data processing. The goal of this paper is to explore NoSQL databases and their effective usage in conjunction with distributed environments to optimize query execution time, in order to accommodate the user complex demands in a cloud computing setting that necessitate the real-time generation of dynamic pages and the provision of dynamic information. In this work, we investigate query optimization using various query execution paths by combining MongoDB and Spark SQL, aiming to reduce the average query execution time. We achieve this goal by improving the query execution time through a sequence of query execution path scenarios that split the initial query into sub-queries between MongoDB and Spark SQL, along with the use of a mediator between Apache Spark and MongoDB. This mediator transfers either the entire database from MongoDB to Spark, or transfers a subset of the results for those sub-queries executed in MongoDB. Our experimental results with eight different query execution path scenarios and six difference database sizes demonstrate the clear superiority and scalability of a specific scenario.
Retail analytics helps a company gain a deeper understanding of customer demand, making shopping more relevant, personalized, and convenient and boosting sales using optimal pricing. This paper aims to demonstrate ret...
详细信息
ISBN:
(纸本)9781665495523
Retail analytics helps a company gain a deeper understanding of customer demand, making shopping more relevant, personalized, and convenient and boosting sales using optimal pricing. This paper aims to demonstrate retail analytics through a prototype that uses big data technologies. Using the big data technologies, the raw data is stored, analyzed and visualized to get valuable decision-making insights. The project objective is to help companies get retail analytics from which they can make decisions to anticipate the Covid-19 effects. The design for the system includes Hadoop Distributed File System (HDFS), Apache Pig, Apache Hive, sparksql, Spark MLLib, and Apache Zeppelin. The prototype uses a dataset that contains information for the transactions in the United Kingdom. Therefore it does not relate to covid-19 retail data but helps answer relevant questions. The dataset is used to investigate revenue aggregate by the country for the top 5 countries, daily sales activity, hourly sales activity, basket size distribution, top 20 Items sold by frequency, and market basket analysis. This paper can be used to produce a production possibility curve, reduce shortage, avoid surplus, illustrate demand and supply curves, and detect current economic conditions. All these would help the decision-makers to develop strategies to help them anticipate the impacts of Covid-19.
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to p...
详细信息
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and sparksql, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with sparksql. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
暂无评论