The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing...
详细信息
The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing mapreducemodel shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.
Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including inf...
详细信息
ISBN:
(纸本)9781509051465
Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. There exist a large number of composed documents in a large amount of corpus. Most of them are required to compute the similarity for validation. In this paper, we propose our approach of measuring similarity between documents in large amount of corpus. For evaluation, we compare the proposed approach with other approaches previously presented by using our new mapreduce algorithm. Simulation results, on Hadoop framework, show that our new mapreduce algorithm outperforms the classical ones in term of running time performance and increases the value of the similarity.
Classification of microarray data has always been a challenging task due to the enormous number of genes. Finding a small, closely related gene set to accurately classify disease cells is an important research problem...
详细信息
ISBN:
(纸本)9781479999255
Classification of microarray data has always been a challenging task due to the enormous number of genes. Finding a small, closely related gene set to accurately classify disease cells is an important research problem. Integrating biological knowledge into genomic analysis to help to improve the interpretation of the results is an effective approach. In this paper, affinity propagation (AP) clustering algorithm is chosen to analyze the impact of the biological similarity on the results. We integrate GO semantic similarity into AP clustering for granule construction. Using mapreduce programming model, a parallel information fusion method is proposed. The process of similarity matrix construction and message passing in AP algorithm is parallelized using mapreduce. Parallel randomly directed hill climb ensemble pruning (RandomDHCEP) method based on mapreduce is introduced for ensemble pruning. An instance analysis represents the process of affinity propagation and ensemble pruning by using iterative mapreduce program. The proposed method can offer good scalability on large data with increasing number of nodes and it can also provide higher classification accuracy rather than using whole gene set for classification.
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hiv...
详细信息
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hiv...
详细信息
For over a decade, mapreduce has become the leading programmingmodel for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.
Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural laye...
详细信息
ISBN:
(纸本)9781479984480
Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to providing performance for mapreduce jobs and minimize cloud resource cost. The contribution of this paper is twofold: (i) we formulate a linear programmingmodel able to minimize cloud resources cost and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees, (ii) we provide new upper and lower bounds for mapreduce job execution time in shared Hadoop clusters. Moreover, our solutions are validated by a large set of experiments. We demonstrate that our method is able to determine the global optimal solution for systems including up to 1000 user classes in less than 0.5 seconds. Moreover, the execution time of mapreduce jobs are within 19% of our upper bounds on average.
Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an util...
详细信息
ISBN:
(纸本)9781479956661
Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an utility to execute applications written in other languages, known as Hadoop Streaming. However, the ease of use provided by Hadoop Streaming comes at the expense of a noticeable degradation in the performance. In this work, we introduce Perldoop, a new tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop while improving their performance significantly. We have tested our tool using several Natural Language Processing (NLP) modules, which consist of hundreds of regular expressions, but Perldoop could be used with any Perl code ready to be executed with Hadoop Streaming. Performance results show that Java codes generated using Perldoop execute up to 12x faster than the original Perl modules using Hadoop Streaming. In this way, the new NLP modules are able to process the whole Wikipedia in less than 2 hours using a Hadoop cluster with 64 nodes.
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset patt...
详细信息
ISBN:
(纸本)9781479976836
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. mapreduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of mapreducemodel primarily based on time requirements.
Infrastructure-as-a-Service clouds are becoming ubiquitous for provisioning virtual machines on demand. Cloud service providers expect to use least resources to deliver best services. As users frequently request virtu...
详细信息
ISBN:
(纸本)9781467324229
Infrastructure-as-a-Service clouds are becoming ubiquitous for provisioning virtual machines on demand. Cloud service providers expect to use least resources to deliver best services. As users frequently request virtual machines to build virtual clusters and run mapreduce-like jobs for big data processing, cloud service providers intend to place virtual machines closely to minimize network latency and subsequently reduce data movement cost. In this paper we focus on the virtual machine placement issue for provisioning virtual clusters with minimum network latency in clouds. We define distance as the latency between virtual machines and use it to measure the affinity of virtual clusters. Such metric of distance indicates the considerations of virtual machine placement and topology of physical nodes in clouds. Then we formulate our problem as the classical shortest distance problem and solve it by modeling to integer programming problem. A greedy virtual machine placement algorithm is designed to get a compact virtual cluster. Furthermore, an improved heuristic algorithm is also presented for achieving a global resource optimization. The simulation results verify our algorithms and the experiment results validate the improvement achieved by our approaches.
暂无评论