检索结果-内蒙古大学图书馆

An Auto-Scaling Framework for Heterogeneous Hadoop Systems

INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS 2017年第4期26卷 1750004-1750004页

作者： Bibal, J. V. Benifa Dejey, D. Anna Univ Dept Comp Sci & Engn Reg Campus Tirunelveli India

The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing mapreduce model shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.

关键词： Auto-scaling scheduling algorithm mapreduce programming model data locality

来源：评论

学校读者我要写书评

暂无评论

Measuring Documents Similarity in Large Corpus using mapreduce Algorithm 5

Measuring Documents Similarity in Large Corpus using MapRedu...

引用

5th International Conference on Multimedia Computing and Systems (ICMCS)

作者： Birjali, Marouane Beni-Hssane, Abderrahim Erritali, Mohammed Univ Chouaib Doukkali Dept Comp Sci Fac Sci El Jadida Morocco Univ Sultan Moulay Slimane Dept Comp Sci Fac Sci & Technol Beni Mellal Morocco

ISBN: (纸本)9781509051465

Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. There exist a large number of composed documents in a large amount of corpus. Most of them are required to compute the similarity for validation. In this paper, we propose our approach of measuring similarity between documents in large amount of corpus. For evaluation, we compare the proposed approach with other approaches previously presented by using our new mapreduce algorithm. Simulation results, on Hadoop framework, show that our new mapreduce algorithm outperforms the classical ones in term of running time performance and increases the value of the similarity.

关键词： Hadoop cluster document similarity mapreduce programming model similarity measure

来源：评论

学校读者我要写书评

暂无评论

Parallel Information Fusion Method for Microarray Data Analysis 3

Parallel Information Fusion Method for Microarray Data Analy...

引用

IEEE International Conference on Big Data

作者： Meng, Jun Li, Rui Zhang, Jing Dalian Univ Technol Sch Comp Sci & Technol Dalian Peoples R China

ISBN: (纸本)9781479999255

Classification of microarray data has always been a challenging task due to the enormous number of genes. Finding a small, closely related gene set to accurately classify disease cells is an important research problem. Integrating biological knowledge into genomic analysis to help to improve the interpretation of the results is an effective approach. In this paper, affinity propagation (AP) clustering algorithm is chosen to analyze the impact of the biological similarity on the results. We integrate GO semantic similarity into AP clustering for granule construction. Using mapreduce programming model, a parallel information fusion method is proposed. The process of similarity matrix construction and message passing in AP algorithm is parallelized using mapreduce. Parallel randomly directed hill climb ensemble pruning (RandomDHCEP) method based on mapreduce is introduced for ensemble pruning. An instance analysis represents the process of affinity propagation and ensemble pruning by using iterative mapreduce program. The proposed method can offer good scalability on large data with increasing number of nodes and it can also provide higher classification accuracy rather than using whole gene set for classification.

关键词： Microarray Data mapreduce programming model Parallel Information Fusion

来源：评论

学校读者我要写书评

暂无评论

Towards Scalability and Data Skew Handling in GroupBy-Joins using mapreduce model

Towards Scalability and Data Skew Handling in GroupBy-Joins ...

引用

15th Annual International Conference on Computational Science (ICCS)

作者： Hassan, M. Al Hajj Bamha, M. Lebanese Int Univ Beirut Lebanon Univ Orleans INSA Ctr Val Loire F-45067 Orleans France

For over a decade, mapreduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.

关键词： Join and GrouBy-join operations Data skew mapreduce programming model Distributed file systems Hadoop framework Apache Pig Latin

来源：评论

学校读者我要写书评

暂无评论

Towards Scalability and Data Skew Handling in GroupBy-Joins using mapreduce model

引用

Procedia Computer Science 2015年 51卷 70-79页

作者： M. Al Hajj Hassan M. Bamha Lebanese International University Beirut Lebanon Universit́e Orĺeans INSA Centre Val de Loire France

关键词： Join and GrouBy-join operations Data skew mapreduce programming model Distributed file systems Hadoop framework Apache Pig Latin

来源：评论

学校读者我要写书评

暂无评论

Optimal Capacity Allocation for executing mapreduce Jobs in Cloud Systems 16

Optimal Capacity Allocation for executing MapReduce Jobs in ...

引用

16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

作者： Malekimajd, M. Rizzi, A. M. Ardagna, D. Ciavotta, M. Passacantando, M. Movaghar, A. Sharif Univ Technol Dept Comp Engn Tehran Iran Politecn Milan Dipartimento Elettron Informaz & Bioingn I-20133 Milan Italy Univ Pisa Dipartimento Informat Pisa Italy

ISBN: (纸本)9781479984480

Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to providing performance for mapreduce jobs and minimize cloud resource cost. The contribution of this paper is twofold: (i) we formulate a linear programming model able to minimize cloud resources cost and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees, (ii) we provide new upper and lower bounds for mapreduce job execution time in shared Hadoop clusters. Moreover, our solutions are validated by a large set of experiments. We demonstrate that our method is able to determine the global optimal solution for systems including up to 1000 user classes in less than 0.5 seconds. Moreover, the execution time of mapreduce jobs are within 19% of our upper bounds on average.

关键词： Big Data cloud computing data analysis linear programming parallel programming pattern clustering resource allocation mapreduce job execution time mapreduce programming model big data applications business intelligence applications capacity allocation cloud resource cost minimization cloud systems infrastructural layer job rejection linear programming model optimal capacity allocation shared Hadoop clusters Bismuth Mathematical model Optimization Resource management Scalability Silicon Upper bound Capacity Allocation Cloud Computing mapreduce Performance bounds Cloud Computing big data linear programming models management of resources Upper bound jobs Parallel programming Data analysis Scalability bismuth Cluster Analysis linear programming Resource utilization Mathematical model Big data applications

来源：评论

学校读者我要写书评

暂无评论

Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters 2

Perldoop: Efficient Execution of Perl Scripts on Hadoop Clus...

引用

IEEE International Conference on Big Data

作者： Abuin, Jose M. Pichel, Juan C. Pena, Tomas F. Gamallo, Pablo Garcia, Marcos Univ Santiago de Compostela Ctr Invest Tecnoloxias Informac CiTIUS Santiago De Compostela Spain

ISBN: (纸本)9781479956661

Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an utility to execute applications written in other languages, known as Hadoop Streaming. However, the ease of use provided by Hadoop Streaming comes at the expense of a noticeable degradation in the performance. In this work, we introduce Perldoop, a new tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop while improving their performance significantly. We have tested our tool using several Natural Language Processing (NLP) modules, which consist of hundreds of regular expressions, but Perldoop could be used with any Perl code ready to be executed with Hadoop Streaming. Performance results show that Java codes generated using Perldoop execute up to 12x faster than the original Perl modules using Hadoop Streaming. In this way, the new NLP modules are able to process the whole Wikipedia in less than 2 hours using a Hadoop cluster with 64 nodes.

关键词： Internet Java data handling natural language processing parallel processing Hadoop Streaming Hadoop clusters Hadoop-ready Perl scripts Java codes mapreduce programming model NLP modules Perl code Perl modules Perldoop Wikipedia Arrays Natural language processing Pragmatics programming Reactive power

来源：评论

学校读者我要写书评

暂无评论

Distributed Pattern Matching and Document Analysis in Big Data using Hadoop mapreduce model 3

Distributed Pattern Matching and Document Analysis in Big Da...

引用

International Conference Parallel Distributed Grid Computing

作者： Ramya, A., V Sivasankar, E. Natl Inst Technol Dept Comp Sci & Engn Tiruchirappalli Tamil Nadu India

ISBN: (纸本)9781479976836

Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. mapreduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of mapreduce model primarily based on time requirements.

关键词： Distributed Processing mapreduce programming model Hadoop Cluster

来源：评论

学校读者我要写书评

暂无评论

Affinity-aware Virtual Cluster Optimization for mapreduce Applications

Affinity-aware Virtual Cluster Optimization for MapReduce Ap...

引用

IEEE International Conference on Cluster Computing

作者： Yan, Cairong Zhu, Ming Yang, Xin Yu, Ze Li, Min Shi, Youqun Li, Xiaolin Donghua Univ Sch Comp Sci & Technol Shanghai Peoples R China Univ Florida Scalable Software Syst Lab Gainesville FL 32611 USA

ISBN: (纸本)9781467324229

Infrastructure-as-a-Service clouds are becoming ubiquitous for provisioning virtual machines on demand. Cloud service providers expect to use least resources to deliver best services. As users frequently request virtual machines to build virtual clusters and run mapreduce-like jobs for big data processing, cloud service providers intend to place virtual machines closely to minimize network latency and subsequently reduce data movement cost. In this paper we focus on the virtual machine placement issue for provisioning virtual clusters with minimum network latency in clouds. We define distance as the latency between virtual machines and use it to measure the affinity of virtual clusters. Such metric of distance indicates the considerations of virtual machine placement and topology of physical nodes in clouds. Then we formulate our problem as the classical shortest distance problem and solve it by modeling to integer programming problem. A greedy virtual machine placement algorithm is designed to get a compact virtual cluster. Furthermore, an improved heuristic algorithm is also presented for achieving a global resource optimization. The simulation results verify our algorithms and the experiment results validate the improvement achieved by our approaches.

关键词： Virtual cluster mapreduce programming model Provisioning Shortest distance Resource optimization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：