Apache spark在丛集系统中提供许多高速运算的模组,其中spark- sql负责分散式资料库高效率的查询等演算法。在分散式资料库,大多数程序都牵涉到分散式系统中不同节点间资料的交换这个成本高、耗时长的过程。洗牌杂凑加入查询是一个评估加入查询的有名演算法,但我们发现他在节点间造成不必要的资料交换,且有机会发生计算负担不平衡的状况。我们提出一个洗牌杂凑加入查询的优化版本来评估半加入查询,其名为RDTS(Reducing Data Transfer for Semijoin)。他不只减少了节点间不必要的资料交换,也确保了各节点的计算负担平衡。我们用Scala这个语言在spark上实作RDTS,且比较其与原本的差异。此外,我们的演算法能够轻易的延伸以评估复数半加入查询。
Combine the power of Apache spark and Python to build effective big data applications Key Features Perform effective data processing, machine learning, and analytics using Pyspark Overcome challenges in developing and...
详细信息
ISBN:
(数字)9781788834254
ISBN:
(纸本)9781788835367
Combine the power of Apache spark and Python to build effective big data applications Key Features Perform effective data processing, machine learning, and analytics using Pyspark Overcome challenges in developing and deploying spark solutions using Python Explore recipes for efficiently combining Python and Apache spark to process data Book Description Apache spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The Pyspark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the spark ecosystem. You'll start by learning the Apache spark architecture and how to set up a Python environment for spark. You'll then get familiar with the modules available in Pyspark and start using them effortlessly. In addition to this, you'll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of Pyspark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of Pyspark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache spark to solve any problems associated with building data-intensive applications. What you will learn Configure a local instance of Pyspark in a virtual environment Install and configure Jupyter in local and multi-node environments Create DataFrames from JSON and a dictionary using *** Explore regression and clustering models available in the ML module Use DataFrames to transform data used for modeling Connect to PubNub and perform aggregations on streams Who this book is for The Pyspark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache spark 2.x ecosystem in the best possible way. A thoroug
Customer Relationship Management (CRM) is a systematic way of working with current and prospective customers to manage long-term relationships and interactions between the company and customers. Recently, Big Data has...
详细信息
Customer Relationship Management (CRM) is a systematic way of working with current and prospective customers to manage long-term relationships and interactions between the company and customers. Recently, Big Data has become a buzzword. It consists of huge data repositories, having information collected from online and offline resources, and it is hard to process such datasets with traditional data processing tools and techniques. The presented research work tries to explore the potential of Big Data to create, optimise and transform an insightful customer relationship management system by analysing large amount of datasets for enhancing customer life cycle profitability. In this research work, a dataset, "Book Crossing" is used for Big Data processing and execution time analysis for simple and complex sql queries. This research tries to analyse the impact of data size on the query execution time for one of the majorly used Big Data frameworks, namely Apache spark. It is a recently developed in-memory Big Data processing framework with a spark sql module for efficient sql query execution. It has been found that Apache-spark gives better results with large size datasets compare to small size datasets and fares better as compared to Hadoop, one of the majorly used Big Data Frameworks (based on qualitative analysis).
随着海事管理和航道管理的数字化推进,目前已经积累了海量的海事和航道相关数据。如何应用大数据技术处理和分析这些海量数据,保障船舶在航道中的航行安全和提升海事监管和航道维护的效率,是目前航海领域的研究热点。本文针对船舶在航道中航行的安全性问题,提出了一种新的基于大数据的船舶驾驶行为评估方法。利用基于AIS大数据时空管道分析的评价指标定量计算,结合模糊综合评价模型,对船舶在一个航程中的安全性进行打分评判,可为船舶驾驶员管理和船舶交通管理提供有价值的参考。本文完成的主要研究工作包括:(1)基于大数据的船舶驾驶行为评估系统架构利用Hadoop大数据生态圈技术构建船舶驾驶行为评估系统的架构。首先,利用Geo Mesa作为时空数据中间件,对船舶AIS数据构建时空索引,实现AIS数据在Hbase分布式数据库中存储。然后,利用spark连接Hbase对AIS时空大数据进行分析,计算各种驾驶行为评估指标;通过Geo Server地图服务器结合Geo Mesa组件对AIS时空大数据进行可视化。(2)基于Geo Mesa的船舶AIS时空管道构建为了对一艘船舶进行驾驶行为评估,需要获取船舶在航程中每个时刻的周围船舶动态。首先,基于spark sql技术,从AIS大数据中快速地将船舶在航程中所有周围船舶动态按时间顺序提取出来,形成时空管道数据,用于后面对评价指标的计算和定量分析,为驾驶行为评估奠定数据基础。其次,基于Geo Server Web Process Service和Geo Mesa Tube Select技术构建时空管道数据的可视化展示,用于复现船舶航行过程中的周围船舶动态。(3)基于熵值法的驾驶行为模糊综合评判模型选取船舶变速异常、船舶变向异常、穿越航道中心线、偏离航道、碰撞危险度和环境压力等6项参数,作为船舶驾驶行为评价指标。根据AIS时空管道技术对各个评价指标的计算和统计结果,采用熵值法对各评价指标所占权重进行动态分配优化并结合模糊综合评价三要素对船舶驾驶行为进行打分和优劣评判。
暂无评论