Data growth in recent years has been swift, leading to the emergence of big data science. distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop distributed File Sys...
详细信息
Data growth in recent years has been swift, leading to the emergence of big data science. distributed File Systems (DFS) are commonly used to handle big data, like Google File System (GFS), Hadoop distributed File System (HDFS), and others. The DFS should provide the availability of data and reliability of the system in case of failure. The DFS replicates the files in different locations to provide availability and reliability. These replications consume storage space and other resources. The importance of these files differs depending on how frequently they are used in the system. So some of these files do not deserve to replicate many times because it is unimportant in the system. This paper introduces a Dynamic Replication Policy using Machine Learning Clustering (DRPMLC) on HDFS, which uses Machine Learning to cluster the files into different groups and apply other replication policies to each group to reduce the storage consumption, improve the read and write operations time and keep the availability and reliability of HDFS as a high-performance distributed computing (HPDC).
Intelligent algorithms such as genetic algorithms and simulated annealing algorithms have widely been applied to the field of large scale data analysis and data processing. It is potential for the high-performance dis...
详细信息
ISBN:
(纸本)9781479980031
Intelligent algorithms such as genetic algorithms and simulated annealing algorithms have widely been applied to the field of large scale data analysis and data processing. It is potential for the high-performance distributed computing technologies or platforms to further increase the execution efficiency of these traditional intelligent algorithms. Against this background, we propose a novel MapReduce enabled simulated annealing genetic algorithm that has two distinctive characteristics. The first is that, our algorithm is the synthesis of the conventional genetic algorithm and the simulated annealing algorithm. While most genetic algorithms are easy to fall into local optimal solution, the simulated annealing algorithm accepts non-optimal solution at a certain probability to jump out of local optimal. This characteristic guarantees our proposed algorithm has a higher probability of getting the global optimal solution than traditional genetic algorithms. The other is that our algorithm is a parallel algorithm running on the high-performance parallel platform Phoenix++ other than a conventional serial genetic algorithm. Phoenix++ implements the MapReduce programming model that processes and generates large data sets with our parallel, distributed algorithm on a cluster. The experiments on Phoenix++ indicate that the convergence speed of the proposed algorithm significantly outperforms its traditional genetic rivals.
暂无评论