检索结果-内蒙古大学图书馆

IEEE International Conference on Big Data

作者： Abuin, Jose M. Pichel, Juan C. Pena, Tomas F. Gamallo, Pablo Garcia, Marcos Univ Santiago de Compostela Ctr Invest Tecnoloxias Informac CiTIUS Santiago De Compostela Spain

ISBN: (纸本)9781479956661

Hadoop is one of the most important implementations of the mapreduce programming model. It is written in Java and most of the programs that run on Hadoop are also written in this language. Hadoop also provides an utility to execute applications written in other languages, known as Hadoop Streaming. However, the ease of use provided by Hadoop Streaming comes at the expense of a noticeable degradation in the performance. In this work, we introduce Perldoop, a new tool that automatically translates Hadoop-ready Perl scripts into its Java counterparts, which can be directly executed on Hadoop while improving their performance significantly. We have tested our tool using several Natural Language Processing (NLP) modules, which consist of hundreds of regular expressions, but Perldoop could be used with any Perl code ready to be executed with Hadoop Streaming. Performance results show that Java codes generated using Perldoop execute up to 12x faster than the original Perl modules using Hadoop Streaming. In this way, the new NLP modules are able to process the whole Wikipedia in less than 2 hours using a Hadoop cluster with 64 nodes.

关键词： Internet Java data handling natural language processing parallel processing Hadoop Streaming Hadoop clusters Hadoop-ready Perl scripts Java codes mapreduce programming model NLP modules Perl code Perl modules Perldoop Wikipedia Arrays Natural language processing Pragmatics programming Reactive power

来源：评论

学校读者我要写书评

暂无评论

Distributed Pattern Matching and Document Analysis in Big Data using Hadoop mapreduce model 3

Distributed Pattern Matching and Document Analysis in Big Da...

引用

International Conference Parallel Distributed Grid Computing

作者： Ramya, A., V Sivasankar, E. Natl Inst Technol Dept Comp Sci & Engn Tiruchirappalli Tamil Nadu India

ISBN: (纸本)9781479976836

Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. mapreduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of mapreduce model primarily based on time requirements.

关键词： Distributed Processing mapreduce programming model Hadoop Cluster

来源：评论

学校读者我要写书评

暂无评论

Measuring Documents Similarity in Large Corpus using mapreduce Algorithm 5

Measuring Documents Similarity in Large Corpus using MapRedu...

引用

5th International Conference on Multimedia Computing and Systems (ICMCS)

作者： Birjali, Marouane Beni-Hssane, Abderrahim Erritali, Mohammed Univ Chouaib Doukkali Dept Comp Sci Fac Sci El Jadida Morocco Univ Sultan Moulay Slimane Dept Comp Sci Fac Sci & Technol Beni Mellal Morocco

ISBN: (纸本)9781509051465

Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. There exist a large number of composed documents in a large amount of corpus. Most of them are required to compute the similarity for validation. In this paper, we propose our approach of measuring similarity between documents in large amount of corpus. For evaluation, we compare the proposed approach with other approaches previously presented by using our new mapreduce algorithm. Simulation results, on Hadoop framework, show that our new mapreduce algorithm outperforms the classical ones in term of running time performance and increases the value of the similarity.

关键词： Hadoop cluster document similarity mapreduce programming model similarity measure

来源：评论

学校读者我要写书评

暂无评论

A Novel Sequential Pattern Mining Algorithm for Large Scale Data Sequences 22nd

A Novel Sequential Pattern Mining Algorithm for Large Scale ...

引用

22nd International Conference on Computational Science and its Applications (ICCSA)

作者： Can, Ali Burak Uzun-Per, Meryem Aktas, Mehmet S. Akdeniz PE TUR AS BiletBank Res & Dev Ctr Istanbul Turkey Istanbul Hlth & Technol Univ Comp Engn Dept Istanbul Turkey Yildiz Tech Univ Comp Engn Dept Istanbul Turkey

ISBN: (纸本)9783031105364;9783031105357

Sequential pattern mining algorithms are unsupervised machine learning algorithms that allow finding sequential patterns on data sequences that have been put together based on a particular order. These algorithms are mostly optimized for finding sequential data sequences containing more than one element. Hence, we argue that there is a need for algorithms that are particularly optimized for data sequences that contain only one element. Within the scope of this research, we study the design and development of a novel algorithm that is optimized for data sets containing data sequences with single elements and that can detect sequential patterns with high performance. The time and memory requirements of the proposed algorithm are examined experimentally. The results show that the proposed algorithm has low running times, while it has the same accuracy results as the algorithms in the similar category in the literature. The obtained results are promising.

关键词： Sequential pattern mining GSP PrefixSpan Large scale data sequences mapreduce programming model

来源：评论

学校读者我要写书评

暂无评论

Optimal Capacity Allocation for executing mapreduce Jobs in Cloud Systems 16

Optimal Capacity Allocation for executing MapReduce Jobs in ...

引用

16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

作者： Malekimajd, M. Rizzi, A. M. Ardagna, D. Ciavotta, M. Passacantando, M. Movaghar, A. Sharif Univ Technol Dept Comp Engn Tehran Iran Politecn Milan Dipartimento Elettron Informaz & Bioingn I-20133 Milan Italy Univ Pisa Dipartimento Informat Pisa Italy

ISBN: (纸本)9781479984480

Nowadays, analyzing large amount of data is of paramount importance for many companies. Big data and business intelligence applications are facilitated by the mapreduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to providing performance for mapreduce jobs and minimize cloud resource cost. The contribution of this paper is twofold: (i) we formulate a linear programming model able to minimize cloud resources cost and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees, (ii) we provide new upper and lower bounds for mapreduce job execution time in shared Hadoop clusters. Moreover, our solutions are validated by a large set of experiments. We demonstrate that our method is able to determine the global optimal solution for systems including up to 1000 user classes in less than 0.5 seconds. Moreover, the execution time of mapreduce jobs are within 19% of our upper bounds on average.

关键词： Big Data cloud computing data analysis linear programming parallel programming pattern clustering resource allocation mapreduce job execution time mapreduce programming model big data applications business intelligence applications capacity allocation cloud resource cost minimization cloud systems infrastructural layer job rejection linear programming model optimal capacity allocation shared Hadoop clusters Bismuth Mathematical model Optimization Resource management Scalability Silicon Upper bound Capacity Allocation Cloud Computing mapreduce Performance bounds Cloud Computing big data linear programming models management of resources Upper bound jobs Parallel programming Data analysis Scalability bismuth Cluster Analysis linear programming Resource utilization Mathematical model Big data applications

来源：评论

学校读者我要写书评

暂无评论

Detecting Text Similarity Using mapreduce Framework

引用

Europe, Middle East and North Africa Conference on Technology and Security to Support Learning (EMENA-TSSL)

作者： Birjali, Marouane Beni-Hssane, Abderrahim Erritali, Mohammed Madani, Youness Univ Chouaib Doukkali Fac Sci Dept Comp Sci LAROSERI Lab El Jadida Morocco Univ Sultan Moulay Slimane Fac Sci & Technol Dept Comp Sci TIAD Lab Beni Mellal Morocco

ISBN: (纸本)9783319465685;9783319465678

The evaluation of similarities between textual documents was regarded as a subject of research strongly recommended in various domains. There are many of documents in a large amount of corpus. Most of them are required to check the similarity for validation. In this paper, we propose a new mapreduce algorithm of document similarity measures. Then we study the state of the art of different approaches for computing the similarity of amount documents to choose the approach that will be used in our mapreduce algorithm. Therefore, we present how the similarity between terms is used in the assessment of the similarity between documents. Simulation results, on Hadoop framework, show that our mapreduce algorithm outperforms classical ones in term of running time.

关键词： Hadoop cluster Document similarity mapreduce programming model Similarity measure

来源：评论

学校读者我要写书评

暂无评论

BigDataSDNSim: A simulator for analyzing big data applications in software-defined cloud data centers

引用

SOFTWARE-PRACTICE & EXPERIENCE 2021年第5期51卷 893-920页

作者： Alwasel, Khaled Calheiros, Rodrigo N. Garg, Saurabh Buyya, Rajkumar Pathan, Mukaddim Georgakopoulos, Dimitrios Ranjan, Rajiv Newcastle Univ Sch Comp Newcastle Upon Tyne Tyne & Wear England Saudi Elect Univ Coll Comp & Informat Riyadh Saudi Arabia Western Sydney Univ Sch Comp Data & Math Sci Sydney NSW Australia Univ Tasmania Sch Comp & Informat Syst Hobart Tas Australia Univ Melbourne Sch Comp & Informat Syst Melbourne Vic Australia Telstra Corp Ltd Melbourne Vic Australia Swinburne Univ Technol Sch Software & Elect Engn Melbourne Vic Australia

The integration and crosscoordination of big data processing and software-defined networking (SDN) are vital for improving the performance of big data applications. Various approaches for combining big data and SDN have been investigated by both industry and academia. However, empirical evaluations of solutions that combine big data processing and SDN are extremely costly and complicated. To address the problem of effective evaluation of solutions that combine big data processing with SDN, we present a new, self-contained simulation tool named BigDataSDNSim that enables the modeling and simulation of the big data management system YARN, its related programming models mapreduce, and SDN-enabled networks in a cloud computing environment. BigDataSDNSim supports cost-effective and easy to conduct experimentation in a controllable, repeatable, and configurable manner. The article illustrates the simulation accuracy and correctness of BigDataSDNSim by comparing the behavior and results of a real environment that combines big data processing and SDN with an equivalent simulated environment. Finally, the article presents two uses cases of BigDataSDNSim, which exhibit its practicality and features, illustrate the impact of data replication mechanisms of mapreduce in Hadoop YARN, and show the superiority of SDN over traditional networks to improve the performance of mapreduce applications.

关键词： big data joint‐ optimization mapreduce programming model modeling and simulation performance optimization software‐ defined networking

来源：评论

学校读者我要写书评

暂无评论

Towards Scalability and Data Skew Handling in GroupBy-Joins using mapreduce model

引用

Procedia Computer Science 2015年 51卷 70-79页

作者： M. Al Hajj Hassan M. Bamha Lebanese International University Beirut Lebanon Universit́e Orĺeans INSA Centre Val de Loire France

For over a decade, mapreduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.

关键词： Join and GrouBy-join operations Data skew mapreduce programming model Distributed file systems Hadoop framework Apache Pig Latin

来源：评论

学校读者我要写书评

暂无评论

An Auto-Scaling Framework for Heterogeneous Hadoop Systems

引用

INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS 2017年第4期26卷

作者： Bibal, J. V. Benifa Dejey, D. Anna Univ Dept Comp Sci & Engn Reg Campus Tirunelveli India

The scalability of the cloud infrastructure is essential to perform large-scale data processing using mapreduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing mapreduce model shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.

关键词： Auto-scaling scheduling algorithm mapreduce programming model data locality

来源：评论

学校读者我要写书评

暂无评论

Research on mapreduce Parallel Optimization Method Based on Improved K-means Clustering Algorithm

Research on MapReduce Parallel Optimization Method Based on ...

引用

作者： Ye Xiong Qingyu Peng Zhenhang Zhang School of Automation & Electrical Engineering Lanzhou Jiaotong University Construction Second Division CMCU Engineering Co. Ltd. School of Electrical Engineering Chongqing University

The traditional K-means clustering algorithm occupies a large quantity of memory resources and computing costs when dealing with massive data. It is easy to be restricted by something such as the initial center point as well abnormal data, and usually can not achieve effective clustering of large-scale data. In order to effectively solve the limitations of the algorithm, we propose a mapreduce parallel optimization method based on improved K-means clustering algorithm. Firstly, differential evolution theory is introduced to determine the optimal initial clustering center, after that, on the basis of the influence of samples on clustering results, the corresponding weighted Euclidean distance is designed to achieve effective data differentiation, so as to effectively reduce the impact of samples on clustering *** negative effect of abnormal data on clustering analysis can improve the accuracy of clustering. Finally, mapreduce programming model is used to realize parallel clustering. We use UCI datasets to verify the parallel optimization method. From the experimental results we can clearly know that the method we proposed has relatively stable parallel clustering results, faster operation speed, and effectively saves the operation time.

关键词： K-means clustering algorithm mapreduce programming model Differential evolution algorithm UCI dataset Initial clustering center Weighted Euclidean distance

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：