A recommender system is a software that can suggest users through prediction based on their previous data usage in the shortest amount of time. Present recommender systems are designed using complex techniques like co...
详细信息
ISBN:
(纸本)9781538642733
A recommender system is a software that can suggest users through prediction based on their previous data usage in the shortest amount of time. Present recommender systems are designed using complex techniques like collaborative filtering, content-based filtering etc. but a similar system can be built by applying complex queries using different query tools. Performance of these query tools depends upon various factors like data size, file formats of the dataset, aggregate search etc. In this paper, we compare four query tools like Hive, Impala, sparksql and MySQL to design a fast and an efficient recommender system. Analysis of these tools is done by comparing the execution time of complex queries on data stored in different file formats like text, CSV, AVRO, PARQUET, RC and ORC. The results obtained indicate that a fast recommender system can be built using a query tool like Impala on a dataset saved in AVRO file format.
Harness the power of Scala to program Spark and analyze tonnes of data in the blink of an eye!About This BookLearn Scala's sophisticated type system that combines Functional Programming and object-oriented concept...
详细信息
ISBN:
(数字)9781783550500
ISBN:
(纸本)9781785280849
Harness the power of Scala to program Spark and analyze tonnes of data in the blink of an eye!About This BookLearn Scala's sophisticated type system that combines Functional Programming and object-oriented concepts Work on a wide array of applications, from simple batch jobs to stream processing and machine learning Explore the most common as well as some complex use-cases to perform large-scale data analysis with Spark Who This Book Is ForAnyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful to pick up concepts quicker. What You Will Learn Understand object-oriented & functional programming concepts of ScalaIn-depth understanding of Scala collection APIs Work with RDD and DataFrame to learn Spark's core abstractionsAnalysing structured and unstructured data using sparksql and GraphX Scalable and fault-tolerant streaming application development using Spark structured streamingLearn machine-learning best practices for classification, regression, dimensionality reduction, and recommendation system to build predictive models with widely used algorithms in Spark MLlib & MLBuild clustering models to cluster a vast amount of dataUnderstand tuning, debugging, and monitoring Spark applicationsDeploy Spark applications on real clusters in Standalone, Mesos, and YARNIn DetailScala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spa
Quick query in the Big Data is important for mining the valuable information to improve the system performance. To achieve this goal, research institutions and internet companies develop three-type script query tools ...
详细信息
ISBN:
(纸本)9781479986460
Quick query in the Big Data is important for mining the valuable information to improve the system performance. To achieve this goal, research institutions and internet companies develop three-type script query tools which are respectively Hive based on MapReduce, Spark SQL based on RDD and Impala based distributed query engine. In this paper, we compare three-type query tools in several ways. First we analyze the impact of the file format for the query time, and we conduct that compression can reduce the amount of data, so as to improve the query time. It is the best choice to take RCFile compressed by Snappy for Hive, and it is the best choice to take Parquet for Impala. Further, Impala has the fastest query speed compared with Hive and Spark SQL. Second we discuss that the file format impact on the CPU and memory. Impala taken Parquet costs the least resource of CPU and memory. Impala taken the file format of Parquet show good performance. So we decide to evaluate Impala and Parquet. Then we find Parquet generated by different query tools show different performance. Finally, we find the query speed of Impala taken the file format of Parquet created by Spark SQL is the fastest. Consequently it is more suitable to use Impala for quick query.
暂无评论