According to data volumes in scientific applications have grown exponentially, new scientific methods to analyze and organize the data are required. mapreduceprogramming is driving Internet services and those service...
详细信息
ISBN:
(纸本)9781479944415
According to data volumes in scientific applications have grown exponentially, new scientific methods to analyze and organize the data are required. mapreduceprogramming is driving Internet services and those services operation in a cloud environment. Hence it is required to efficiently provide resources for handling diverse mapreduce applications. In this paper we show the Hadoop application with map and reduce functions for the data transformation
Sequential pattern mining algorithms are unsupervised machine learning algorithms that allow finding sequential patterns on data sequences that have been put together based on a particular order. These algorithms are ...
详细信息
ISBN:
(纸本)9783031105364;9783031105357
Sequential pattern mining algorithms are unsupervised machine learning algorithms that allow finding sequential patterns on data sequences that have been put together based on a particular order. These algorithms are mostly optimized for finding sequential data sequences containing more than one element. Hence, we argue that there is a need for algorithms that are particularly optimized for data sequences that contain only one element. Within the scope of this research, we study the design and development of a novel algorithm that is optimized for data sets containing data sequences with single elements and that can detect sequential patterns with high performance. The time and memory requirements of the proposed algorithm are examined experimentally. The results show that the proposed algorithm has low running times, while it has the same accuracy results as the algorithms in the similar category in the literature. The obtained results are promising.
The integration and crosscoordination of big data processing and software-defined networking (SDN) are vital for improving the performance of big data applications. Various approaches for combining big data and SDN ha...
详细信息
The integration and crosscoordination of big data processing and software-defined networking (SDN) are vital for improving the performance of big data applications. Various approaches for combining big data and SDN have been investigated by both industry and academia. However, empirical evaluations of solutions that combine big data processing and SDN are extremely costly and complicated. To address the problem of effective evaluation of solutions that combine big data processing with SDN, we present a new, self-contained simulation tool named BigDataSDNSim that enables the modeling and simulation of the big data management system YARN, its related programmingmodels mapreduce, and SDN-enabled networks in a cloud computing environment. BigDataSDNSim supports cost-effective and easy to conduct experimentation in a controllable, repeatable, and configurable manner. The article illustrates the simulation accuracy and correctness of BigDataSDNSim by comparing the behavior and results of a real environment that combines big data processing and SDN with an equivalent simulated environment. Finally, the article presents two uses cases of BigDataSDNSim, which exhibit its practicality and features, illustrate the impact of data replication mechanisms of mapreduce in Hadoop YARN, and show the superiority of SDN over traditional networks to improve the performance of mapreduce applications.
In order to improve the data collection effect of the wireless sensor network, a data collection system based on symmetric encryption algorithm is designed via health sensor. Upload the received data to the host via R...
详细信息
In order to improve the data collection effect of the wireless sensor network, a data collection system based on symmetric encryption algorithm is designed via health sensor. Upload the received data to the host via RS-232 to get the working mode and clock activity. The data acquisition circuit is designed with MSP430 module. The mapreduce programming model is used to complete data collection, a symmetric encryption algorithm is introduced, and a range data encryption query scheme with privacy protection function is designed. Apply it to the node data of the wireless sensor network to realize the secure data collection of the wireless sensor network. Experimental results show that the system has the advantages of high efficiency, large amount of data collection, and high residual energy of sensor network nodes.
In this paper, we proposed a novel parallel method for extraction of significant information from spectrograms using mapreduce programming model for the audio-based surveillance system, which effectively recognizes cr...
详细信息
In this paper, we proposed a novel parallel method for extraction of significant information from spectrograms using mapreduce programming model for the audio-based surveillance system, which effectively recognizes critical acoustic events in the surrounding environment. Extraction of reliable information as features from spectrograms of big noisy audio event dataset demands high computational time. Parallelizing the feature extraction using mapreduce programming model on Hadoop improves the efficiency of the overall system. The acoustic events with real-time background noise from Mivia lab audio event data set are used for surveillance applications. The proposed approach is time efficient and achieves high performance of recognizing critical acoustic events with the average recognition rate of 96.5% in different noisy conditions. (C) 2019 Elsevier Inc. All rights reserved.
The traditional K-means clustering algorithm occupies a large quantity of memory resources and computing costs when dealing with massive data. It is easy to be restricted by something such as the initial center point ...
详细信息
The traditional K-means clustering algorithm occupies a large quantity of memory resources and computing costs when dealing with massive data. It is easy to be restricted by something such as the initial center point as well abnormal data, and usually can not achieve effective clustering of large-scale data. In order to effectively solve the limitations of the algorithm, we propose a mapreduce parallel optimization method based on improved K-means clustering algorithm. Firstly, differential evolution theory is introduced to determine the optimal initial clustering center, after that, on the basis of the influence of samples on clustering results, the corresponding weighted Euclidean distance is designed to achieve effective data differentiation, so as to effectively reduce the impact of samples on clustering *** negative effect of abnormal data on clustering analysis can improve the accuracy of clustering. Finally, mapreduce programming model is used to realize parallel clustering. We use UCI datasets to verify the parallel optimization method. From the experimental results we can clearly know that the method we proposed has relatively stable parallel clustering results, faster operation speed, and effectively saves the operation time.
mapreduce is a parallel programmingmodel for processing the data-intensive applications in a cloud environment. The scheduler greatly influences the performance of mapreducemodel while utilized in heterogeneous clus...
详细信息
mapreduce is a parallel programmingmodel for processing the data-intensive applications in a cloud environment. The scheduler greatly influences the performance of mapreducemodel while utilized in heterogeneous cluster environment. The dynamic nature of cluster environment and computing workloads affect the execution time and computational resource usage in the scheduling process. Further, data locality is essential for reducing total job execution time, cross-rack communication, and to improve the throughput. In the present work, a scheduling strategy named efficient locality and replica aware scheduling (ELRAS) integrated with an autonomous replication scheme (ARS) is proposed to enhance the data locality and performs consistently in the heterogeneous environment. ARS autonomously decides the data object to be replicated by considering its popularity and removes the replica as it is idle. The proposed approach is validated in a heterogeneous cluster environment with various realistic applications that are IO bound, CPU bound and mixed workloads. ELRAS improves the throughput by a factor about 2 as compared with the existing FIFO and it also yields near optimal data locality, reduce the execution time, and effective utilization of resources. The simplicity of ELRAS algorithm proves its feasibility to adopt for a wide range of applications.
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the e...
详细信息
Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the mapreduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.
Mining frequent itemsets in large datasets has received much attention, in recent years, relying on mapreduce programming models. Many famous FIM algorithms have been parallelized in a mapreduce framework like Paralle...
详细信息
ISBN:
(数字)9783319644684
ISBN:
(纸本)9783319644684;9783319644677
Mining frequent itemsets in large datasets has received much attention, in recent years, relying on mapreduce programming models. Many famous FIM algorithms have been parallelized in a mapreduce framework like Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most papers focus on work partitioning and/or load balancing but they are not extensible because they require some memory assumptions. A challenge in designing parallel FIM algorithms is thus finding ways to guarantee that data structures used during mining always fit in the local memory of the processing nodes during all computation steps. In this paper, we propose MapFIM, a two-phase approach for frequent itemset mining in very large datasets relying both on a mapreduce-based distributed Apriori method and a local in-memory method. In our approach, mapreduce is first used to generate local memory-fitted prefix-projected databases from the input dataset benefiting from the Apriori principle. Then an optimized local in-memory mining process is launched to generate all frequent itemsets from each prefix-projected database. Performance evaluation shows that MapFIM is more efficient and more extensible than existing mapreduce based frequent itemset mining approaches.
The evaluation of similarities between textual documents was regarded as a subject of research strongly recommended in various domains. There are many of documents in a large amount of corpus. Most of them are require...
详细信息
ISBN:
(纸本)9783319465685;9783319465678
The evaluation of similarities between textual documents was regarded as a subject of research strongly recommended in various domains. There are many of documents in a large amount of corpus. Most of them are required to check the similarity for validation. In this paper, we propose a new mapreduce algorithm of document similarity measures. Then we study the state of the art of different approaches for computing the similarity of amount documents to choose the approach that will be used in our mapreduce algorithm. Therefore, we present how the similarity between terms is used in the assessment of the similarity between documents. Simulation results, on Hadoop framework, show that our mapreduce algorithm outperforms classical ones in term of running time.
暂无评论