检索结果-内蒙古大学图书馆

科学技术创新 2021年第15期 82-83页

作者：芦成刚王桂荣延边大学工学院吉林延吉133002

通过用户浏览网页时的网络日志分析,可发现用户的一些浏览习惯,从而有针对性的对网站进行改进,给用户带来更好的体验。本文通过搭建多个虚拟机对Web日志进行离线分析,通过flume系统收集日志,利用Hadoop文件系统存储,sparksql进行离线分... 详细信息

通过用户浏览网页时的网络日志分析,可发现用户的一些浏览习惯,从而有针对性的对网站进行改进,给用户带来更好的体验。本文通过搭建多个虚拟机对Web日志进行离线分析,通过flume系统收集日志,利用Hadoop文件系统存储,sparksql进行离线分析,按照需求进行相应业务的统计的分析。

关键词： Web日志 flume系统 Hadoop文件系统 sparksql

来源：评论

学校读者我要写书评

暂无评论

Performance Analysis of RDBMS and Hadoop Components with their File Formats for the development of Recommender Systems 3

Performance Analysis of RDBMS and Hadoop Components with the...

引用

3rd International Conference for Convergence in Technology (I2CT)

作者： Gupta, Anchal Saxena, Merry Gill, Rupali Chitkara Univ Comp Sci Rajpura Punjab India

ISBN: (纸本)9781538642733

A recommender system is a software that can suggest users through prediction based on their previous data usage in the shortest amount of time. Present recommender systems are designed using complex techniques like collaborative filtering, content-based filtering etc. but a similar system can be built by applying complex queries using different query tools. Performance of these query tools depends upon various factors like data size, file formats of the dataset, aggregate search etc. In this paper, we compare four query tools like Hive, Impala, sparksql and MySQL to design a fast and an efficient recommender system. Analysis of these tools is done by comparing the execution time of complex queries on data stored in different file formats like text, CSV, AVRO, PARQUET, RC and ORC. The results obtained indicate that a fast recommender system can be built using a query tool like Impala on a dataset saved in AVRO file format.

关键词： Impala MySQL Hive Hadoop HDFS Big Data Spark sparksql ORC RC Avro CSV Parquet Recommender System

来源：评论

学校读者我要写书评

暂无评论

基于分布式架构的智能学术大数据存储与挖掘

基于分布式架构的智能学术大数据存储与挖掘

引用

作者：罗希意上海交通大学

学位级别：硕士

科学研究是提高社会生产力和国家综合国力的战略支撑。在全世界范围内,每年在计算机科学、基础科学、医学、经济学和社会学等一系列学科领域都会产出数以百万计的知识文献,呈爆发式增长势头。同时伴随着因特网的快速发展与普及,使得知... 详细信息

科学研究是提高社会生产力和国家综合国力的战略支撑。在全世界范围内,每年在计算机科学、基础科学、医学、经济学和社会学等一系列学科领域都会产出数以百万计的知识文献,呈爆发式增长势头。同时伴随着因特网的快速发展与普及,使得知识文献的传播和共享变得非常容易,由此进入到学术大数据时代。面对如此广袤的学术信息资源,如何对其进行智能的存储与挖掘,是一项尤为重要的工作,主要涉及到数据库系统、分布式计算和机器学习三个计算机科学领域的综合应用。本课题以国内学术搜索系统AceMap(亦称PaperBook)作为研究对象,通过设计关系型数据表以存储学术实体及其逻辑关系,针对系统存在的性能瓶颈提出了两种SQL查询的优化方法(分别基于传统关系型数据库和分布式架构下的系统环境),最后探索了基于分布式架构的机器学习框架在AceMap系统中的应用。本学位论文的主要贡献包括:·应用Window Functions机制(Partitioning分区、Ordering排序、Framing分帧)对AceMap系统中存在的大量分析型SQL查询进行了优化。实验结果表明,该优化能够在一定程度上提高系统性能,最高能够减少18.6%的查询执行时间。·完成部分学术大数据到Hadoop分布式文件系统的同步迁移,应用SQL-on-Hadoop技术框架sparksql执行复杂查询。同时,结合数据规模和分布式集群的结构,对Spark集群的核心参数(Spark执行器相关)进行调优。实验结果表明,该优化能够大幅度提高系统性能,最高能够减少93.9%的查询执行时间。·应用分布式机器学习框架Spark MLlib对学术主题进行了挖掘,拓展并提高了AceMap系统知识发现的能力。

关键词：关系型数据库结构化查询语言窗口函数 SQL-on-Hadoop sparksql 机器学习

来源：评论

学校读者我要写书评

暂无评论

Scala and Spark for Big Data Analytics: Explore the concepts of functional programming, data streaming, and machine learning 1

引用

2017年

作者： Md. Rezaul Karim Sridhar Alla

ISBN: (数字)9781783550500

ISBN: (纸本)9781785280849

Harness the power of Scala to program Spark and analyze tonnes of data in the blink of an eye!About This BookLearn Scala's sophisticated type system that combines Functional Programming and object-oriented concepts Work on a wide array of applications, from simple batch jobs to stream processing and machine learning Explore the most common as well as some complex use-cases to perform large-scale data analysis with Spark Who This Book Is ForAnyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful to pick up concepts quicker. What You Will Learn Understand object-oriented & functional programming concepts of ScalaIn-depth understanding of Scala collection APIs Work with RDD and DataFrame to learn Spark's core abstractionsAnalysing structured and unstructured data using sparksql and GraphX Scalable and fault-tolerant streaming application development using Spark structured streamingLearn machine-learning best practices for classification, regression, dimensionality reduction, and recommendation system to build predictive models with widely used algorithms in Spark MLlib & MLBuild clustering models to cluster a vast amount of dataUnderstand tuning, debugging, and monitoring Spark applicationsDeploy Spark applications on real clusters in Standalone, Mesos, and YARNIn DetailScala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spa

关键词： Scala Apark apache spark machine learning spark 2.0 Mllib graphx neo4j akka beyesian analysis k-means k means sparksql hdfs RDD Spark REPL.

来源：评论

学校读者我要写书评

暂无评论

基于脚本语言的网络流量分析与优化

基于脚本语言的网络流量分析与优化

引用

作者：李晓鹏北京邮电大学

学位级别：硕士

这些年来,随着互联网规模的急速增长,对网络流量的监控与分析逐渐成为了一项重要的事情。由此产生海量数据的存储、计算和分析已经逐渐成为一个重大问题。对于网络流量的分析逐渐由单机转向了 Hadoop分布式系统。同时为了方便数据分析... 详细信息

这些年来,随着互联网规模的急速增长,对网络流量的监控与分析逐渐成为了一项重要的事情。由此产生海量数据的存储、计算和分析已经逐渐成为一个重大问题。对于网络流量的分析逐渐由单机转向了 Hadoop分布式系统。同时为了方便数据分析人员的使用,开发了基于传统MapReduce作业的Hive和Pig。但是随着快速与实时性数据分析的需求,陆续诞生了 Spark SQL和Impala等不同于传统MapReduce架构的脚本语言。但是对这些不同类型的大数据脚本语言的相关性能优化以及对比的研究还是较少的,无法充分发挥分布式系统分析网络流量的优势。因此本论文将对Hive和Pig,Spark SQL和Impala三种不同类型的脚本语言进行优化和横向上的对比。本文首先介绍了论文的研究背景和相关领域的研究现状。然后介绍了网络流量分析的现状,分布式系统网络流量分析的典型框架与使用分布式系统进行数据分析的原因。随后,从源码的角度介绍了MapReduce计算框架与Hive、Pig的架构。然后从几个常用的方面对Hive和Pig进行了优化,比如:合并小文件、中间结果输出压缩与Join优化策略,并对网络流量数据进行了分析与优化。接着,分析了Spark和Spark SQL的架构,Spark计算模型相对于MapReduce的优势,然后从内存管理的角度优化了 Spark SQL,比如:缓存的使用,StorageLevel与数据的序列化。然后从文件存储和文件格式的角度,比较了常见的几种文件格式(SequenceFile,RCFile与Parquet)与压缩方式(Gzip,Bzip,Snappy,Lzo)的优缺点以及适用的场景。从压缩方式的角度,分析了几种常见压缩格式的异同。最后,本文搭建了基于CDH5的网络流量分布式系统,选择了 7种常见的网络流量分析需求,构建了数学模型,从分析工具、文件格式、压缩方式三个维度,全面分析比较了这三种经典的大数据常用工具。

关键词：网络流量分析性能优化 Hadoop sparksql Impala 文件格式压缩

来源：评论

学校读者我要写书评

暂无评论

Performance Comparison of Hive Impala and Spark SQL 7

Performance Comparison of Hive Impala and Spark SQL

引用

7th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)

作者： Li, Xiaopeng Zhou, Wenli Beijing Univ Posts & Telecommun Beijing Key Lab Network Syst Architecture & Conve Beijing 100088 Peoples R China

ISBN: (纸本)9781479986460

Quick query in the Big Data is important for mining the valuable information to improve the system performance. To achieve this goal, research institutions and internet companies develop three-type script query tools which are respectively Hive based on MapReduce, Spark SQL based on RDD and Impala based distributed query engine. In this paper, we compare three-type query tools in several ways. First we analyze the impact of the file format for the query time, and we conduct that compression can reduce the amount of data, so as to improve the query time. It is the best choice to take RCFile compressed by Snappy for Hive, and it is the best choice to take Parquet for Impala. Further, Impala has the fastest query speed compared with Hive and Spark SQL. Second we discuss that the file format impact on the CPU and memory. Impala taken Parquet costs the least resource of CPU and memory. Impala taken the file format of Parquet show good performance. So we decide to evaluate Impala and Parquet. Then we find Parquet generated by different query tools show different performance. Finally, we find the query speed of Impala taken the file format of Parquet created by Spark SQL is the fastest. Consequently it is more suitable to use Impala for quick query.

关键词： Hive Impala style sparksql compression file format

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：