The key challenge in big data processing frameworks such as hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance rea...
详细信息
The key challenge in big data processing frameworks such as hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.
hadoop distributed file system (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the hadoop clust...
详细信息
hadoop distributed file system (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New hadoop Archive (New HAR), CombinefileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5s, memory usage is 20 KB in multi-node, and the processing time is 0.1s.
hadoop, a distributed processing framework for big-data, is now widely used for multimedia processing. However, when processing video data from a hadoop distributed file system (HDFS), unnecessary network traffic is g...
详细信息
hadoop, a distributed processing framework for big-data, is now widely used for multimedia processing. However, when processing video data from a hadoop distributed file system (HDFS), unnecessary network traffic is generated due to an inefficient HDFS block slice policy for picture frames in video files. We propose a new block replication policy to solve this problem and compare the newly proposed HDFS with the original HDFS via extensive experiments. The proposed HDFS reduces network traffic, and increases locality between processing cores and file locations.
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of hadoop distributed file system (HDFS) solutions. However, the performance of hadoop and indeed HD...
详细信息
ISBN:
(纸本)9781450356299
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of hadoop distributed file system (HDFS) solutions. However, the performance of hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system;this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the 'hot' data increases the availability and locality of the data, and thus, decreases the job execution time.
The hadoop distributed file system (HDFS) is the storage of choice when it comes to large-scale distributedsystems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through t...
详细信息
ISBN:
(纸本)9781538672327
The hadoop distributed file system (HDFS) is the storage of choice when it comes to large-scale distributedsystems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
All the machines are required to be under a common administrator and be able to communicate securely. To communicate securely, Advanced Encryption Standard (AES) algorithm is used to enable protection of data at each ...
详细信息
ISBN:
(纸本)9781538619599
All the machines are required to be under a common administrator and be able to communicate securely. To communicate securely, Advanced Encryption Standard (AES) algorithm is used to enable protection of data at each clusters. It performs encryption and decryption before read and write respectively. Encryption and Decryption in key is used for securing hadoop distributed file system. The existing system depends on a single name node to manage almost all operations of every data block in the filesystem. As a result it can be a bottleneck resource and a single point of failure. To overcome this failure, Load Rebalancing Algorithm is used for hadoop distributed file system using distributed Hash Table. The Proposed load rebalancing algorithm will be compared against a centralized approach in a production system and a competing distributed solution is presented in the literature. The storage nodes are structured as a network based on distributed hash table discovering a file chunk can simply refer to rapid key lookup in DHTs, given that a unique handle (or identifier) is assigned to each file chunk.
Today, hadoop distributed file system (HDFS) is widely used to provide scalable and fault-tolerant storage of large volumes of data. One of the key issues that affect the performance of HDFS is the placement of data r...
详细信息
ISBN:
(纸本)9781509024032
Today, hadoop distributed file system (HDFS) is widely used to provide scalable and fault-tolerant storage of large volumes of data. One of the key issues that affect the performance of HDFS is the placement of data replicas. Although the current HDFS replica placement policy can achieve both fault tolerance and read/write efficiency, the policy cannot evenly distribute replicas across cluster nodes, and has to rely on load balancing utility to balance replica distributions. In this paper, we present a new replica placement policy for HDFS, which can generate replica distributions that are not only perfectly even but also meet all HDFS replica placement requirements.
hadoop distributed file system (HDFS) is developed to store a huge volume of data. files are divided into blocks and the replicated blocks are then stored on many DataNodes in a distributed manner. Although doing so m...
详细信息
ISBN:
(纸本)9781479999422
hadoop distributed file system (HDFS) is developed to store a huge volume of data. files are divided into blocks and the replicated blocks are then stored on many DataNodes in a distributed manner. Although doing so makes HDFS fault tolerant, the random nature of the default block placement strategy may lead to load imbalance among the DataNodes. Moreover, the built-in load-balancing algorithm Balancer may reduce the performance and consume lots of network resources. Therefore in this paper we consider all the situations that may influence the load-balancing state and propose a new load-balancing algorithm. In the proposed algorithm a new role named BalanceNode is introduced to help in matching heavy loaded and light-loaded DataNodes, so those light-loaded nodes can share part of the load from heavy-loaded ones. The simulation results show that our algorithm can achieve a good load-balancing state in the HDFS compared with two existing algorithms.
Cloud computing is composed of a large number of distributed computation and storage resources to facilitate the management of distributed and sharing data resources *** is a great challenge to ensure efficient access...
详细信息
Cloud computing is composed of a large number of distributed computation and storage resources to facilitate the management of distributed and sharing data resources *** is a great challenge to ensure efficient access of data replication to such huge and widely distributed data in cloud *** address this need,we proposed an Efficient Data Access Scheme(EDAS) of data replication for hadoop distributed file system(HDFS) to adaptively select the replica of data file form among service *** is an open source cloud based storage platform and deigned to be deployed in low-cost commodity *** HDFS,data are distributed and replicated in cluster of commodity *** supports the access nodes decision of replica data for the users to get quick access form the adaptive services nodes according to the load of *** to provide the high performance of replication access and achieve load balance of service nodes,the proposed EDAS Algorithm implements based on historical data access record form the metadata of HDFS and anti-blocking probability selection method.
The MapReduce programming model, in which the data nodes perform both the data storing and the computation, was introduced for big-data processing. Thus, we need to understand the different resource requirements of da...
详细信息
The MapReduce programming model, in which the data nodes perform both the data storing and the computation, was introduced for big-data processing. Thus, we need to understand the different resource requirements of data storing and computation tasks and schedule these efficiently over multi-core processors. In particular, the provision of high-performance data storing has become more critical because of the continuously increasing volume of data uploaded to distributedfilesystems and database servers. However, the analysis of the performance characteristics of the processes that store upstream data is very intricate, because both network and disk inputs/outputs (I/O) are heavily involved in their operations. In this paper, we analyze the impact of core affinity on both network and disk I/O performance and propose a novel approach for dynamic core affinity for high-throughput file upload. We consider the dynamic changes in the processor load and the intensiveness of the file upload at run-time, and accordingly decide the core affinity for service threads, with the objective of maximizing the parallelism, data locality, and resource efficiency. We apply the dynamic core affinity to hadoop distributed file system (HDFS). Measurement results show that our implementation can improve the file upload throughput of end applications by more than 30% as compared with the default HDFS, and provide better scalability. (C) 2014 Elsevier B.V. All rights reserved.
暂无评论