Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an util...
详细信息
ISBN:
(纸本)9781479956661
Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an utility to execute applications written in other languages, known as Hadoop Streaming. However, the ease of use provided by Hadoop Streaming comes at the expense of a noticeable degradation in the performance. In this work, we introduce Perldoop, a new tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop while improving their performance significantly. We have tested our tool using several Natural Language Processing (NLP) modules, which consist of hundreds of regular expressions, but Perldoop could be used with any Perl code ready to be executed with Hadoop Streaming. Performance results show that Java codes generated using Perldoop execute up to 12x faster than the original Perl modules using Hadoop Streaming. In this way, the new NLP modules are able to process the whole Wikipedia in less than 2 hours using a Hadoop cluster with 64 nodes.
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset patt...
详细信息
ISBN:
(纸本)9781479976836
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. mapreduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of mapreducemodel primarily based on time requirements.
Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including inf...
详细信息
ISBN:
(纸本)9781509051465
Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. There exist a large number of composed documents in a large amount of corpus. Most of them are required to compute the similarity for validation. In this paper, we propose our approach of measuring similarity between documents in large amount of corpus. For evaluation, we compare the proposed approach with other approaches previously presented by using our new mapreduce algorithm. Simulation results, on Hadoop framework, show that our new mapreduce algorithm outperforms the classical ones in term of running time performance and increases the value of the similarity.
Sequential pattern mining algorithms are unsupervised machine learning algorithms that allow finding sequential patterns on data sequences that have been put together based on a particular order. These algorithms are ...
详细信息
ISBN:
(纸本)9783031105364;9783031105357
Sequential pattern mining algorithms are unsupervised machine learning algorithms that allow finding sequential patterns on data sequences that have been put together based on a particular order. These algorithms are mostly optimized for finding sequential data sequences containing more than one element. Hence, we argue that there is a need for algorithms that are particularly optimized for data sequences that contain only one element. Within the scope of this research, we study the design and development of a novel algorithm that is optimized for data sets containing data sequences with single elements and that can detect sequential patterns with high performance. The time and memory requirements of the proposed algorithm are examined experimentally. The results show that the proposed algorithm has low running times, while it has the same accuracy results as the algorithms in the similar category in the literature. The obtained results are promising.
Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural laye...
详细信息
ISBN:
(纸本)9781479984480
Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to providing performance for mapreduce jobs and minimize cloud resource cost. The contribution of this paper is twofold: (i) we formulate a linear programmingmodel able to minimize cloud resources cost and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees, (ii) we provide new upper and lower bounds for mapreduce job execution time in shared Hadoop clusters. Moreover, our solutions are validated by a large set of experiments. We demonstrate that our method is able to determine the global optimal solution for systems including up to 1000 user classes in less than 0.5 seconds. Moreover, the execution time of mapreduce jobs are within 19% of our upper bounds on average.
The evaluation of similarities between textual documents was regarded as a subject of research strongly recommended in various domains. There are many of documents in a large amount of corpus. Most of them are require...
详细信息
ISBN:
(纸本)9783319465685;9783319465678
The evaluation of similarities between textual documents was regarded as a subject of research strongly recommended in various domains. There are many of documents in a large amount of corpus. Most of them are required to check the similarity for validation. In this paper, we propose a new mapreduce algorithm of document similarity measures. Then we study the state of the art of different approaches for computing the similarity of amount documents to choose the approach that will be used in our mapreduce algorithm. Therefore, we present how the similarity between terms is used in the assessment of the similarity between documents. Simulation results, on Hadoop framework, show that our mapreduce algorithm outperforms classical ones in term of running time.
The integration and crosscoordination of big data processing and software-defined networking (SDN) are vital for improving the performance of big data applications. Various approaches for combining big data and SDN ha...
详细信息
The integration and crosscoordination of big data processing and software-defined networking (SDN) are vital for improving the performance of big data applications. Various approaches for combining big data and SDN have been investigated by both industry and academia. However, empirical evaluations of solutions that combine big data processing and SDN are extremely costly and complicated. To address the problem of effective evaluation of solutions that combine big data processing with SDN, we present a new, self-contained simulation tool named BigDataSDNSim that enables the modeling and simulation of the big data management system YARN, its related programmingmodels mapreduce, and SDN-enabled networks in a cloud computing environment. BigDataSDNSim supports cost-effective and easy to conduct experimentation in a controllable, repeatable, and configurable manner. The article illustrates the simulation accuracy and correctness of BigDataSDNSim by comparing the behavior and results of a real environment that combines big data processing and SDN with an equivalent simulated environment. Finally, the article presents two uses cases of BigDataSDNSim, which exhibit its practicality and features, illustrate the impact of data replication mechanisms of mapreduce in Hadoop YARN, and show the superiority of SDN over traditional networks to improve the performance of mapreduce applications.
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hiv...
详细信息
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.
The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing...
详细信息
The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing mapreducemodel shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.
The traditional K-means clustering algorithm occupies a large quantity of memory resources and computing costs when dealing with massive data. It is easy to be restricted by something such as the initial center point ...
详细信息
The traditional K-means clustering algorithm occupies a large quantity of memory resources and computing costs when dealing with massive data. It is easy to be restricted by something such as the initial center point as well abnormal data, and usually can not achieve effective clustering of large-scale data. In order to effectively solve the limitations of the algorithm, we propose a mapreduce parallel optimization method based on improved K-means clustering algorithm. Firstly, differential evolution theory is introduced to determine the optimal initial clustering center, after that, on the basis of the influence of samples on clustering results, the corresponding weighted Euclidean distance is designed to achieve effective data differentiation, so as to effectively reduce the impact of samples on clustering *** negative effect of abnormal data on clustering analysis can improve the accuracy of clustering. Finally, mapreduce programming model is used to realize parallel clustering. We use UCI datasets to verify the parallel optimization method. From the experimental results we can clearly know that the method we proposed has relatively stable parallel clustering results, faster operation speed, and effectively saves the operation time.
暂无评论