A storage system in cloud is well thought-out as a very big scale storage system that has independent storage servers. The service that cloud storage provides is, that can store user's data from remote through net...
详细信息
ISBN:
(纸本)9781509017072
A storage system in cloud is well thought-out as a very big scale storage system that has independent storage servers. The service that cloud storage provides is, that can store user's data from remote through network and other authenticated users can access the data easily. Hadoop distributed file system is used to store large files consistently and to retrieve those files at very high bandwidth to user applications. Hadoop splits the files into large blocks and distributes them amongst the nodes in the cluster. When we retrieve data from the cloud, it is very important that the computation and communication overhead should be reduced. To reduce the communication overhead the server should send only the top-n files based on the keyword when the user asks for the data files. Since the owner need not maintain the copy of the files, it is all the more necessary to make check on the files available and also check the originality of the files stored in the server periodically. In HDFS the computation is done in parallel so that the execution time is drastically reduced. In the proposed system for retrieving top-n files we use Hadoop distributed File System, so that the search time and the communication overhead is greatly reduced.
Typically called big data processing, processing large volumes of data from geographically distributed regions with machine learning algorithms has emerged as an important analytical tool for governments and multinati...
详细信息
ISBN:
(纸本)9781467399548
Typically called big data processing, processing large volumes of data from geographically distributed regions with machine learning algorithms has emerged as an important analytical tool for governments and multinational corporations. The traditional wisdom calls for the collection of all the data across the world to a central datacenter location, to be processed using data-parallel applications. This is neither efficient nor practical as the volume of data grows exponentially. Rather than transferring data, we believe that computation tasks should be scheduled where the data is, while data should be processed with a minimum amount of transfers across datacenters. In this paper, we design and implement Flutter, a new task scheduling algorithm that improves the completion times of big data processing jobs across geographically distributed datacenters. To cater to the specific characteristics of data-parallel applications, we first formulate our problem as a lexicographical min-max integer linear programming (ILP) problem, and then transform it into a nonlinear program with a separable convex objective function and a totally unimodular constraint matrix, which can be solved using a standard linear programming solver efficiently in an online fashion. Our implementation of Flutter is based on Apache Spark, a modern framework popular for big data processing. Our experimental results have shown that we can reduce the job completion time by up to 25%, and the amount of traffic transferred among datacenters by up to 75%.
The paper proposes a series of novel routing algorithms to the issue of accessing the large-scale heterogeneous information over the Internet. Different from traditional ones, the perspectives of the routing are updat...
详细信息
The paper proposes a series of novel routing algorithms to the issue of accessing the large-scale heterogeneous information over the Internet. Different from traditional ones, the perspectives of the routing are updated in terms of the quality of information accessing, the types of resources, the dominant driver and the expected consequences. The human capital is connected to form the social capital to resolve the routing problems. Although the human capital dominates the particular information system, it is difficult to capture it to fulfill requirements of users. For that, the social capital, which is shaped in the three graphs, i.e., the data, the channel and the human, is leveraged to perform the primary tasks since the human capital is difficult to be exploited directly. Those graphs represent the comprehensive perspectives to the environment and they are constructed in accordance with users' information accessing behaviors. In the early phase, the seed graphs are constructed to initialize the system. The routing does not expect to find their exactly required information resources. Instead, it aims to provide users with relevant consequences through exploring the nearby human capital.
As the rapid development of big data applications, more and more data analytics are based on geographically distributed data centers. Recent works mainly focus on task and data placement to reduce data transmission am...
详细信息
ISBN:
(纸本)9781509032068
As the rapid development of big data applications, more and more data analytics are based on geographically distributed data centers. Recent works mainly focus on task and data placement to reduce data transmission among these geodistributed data centers. In this paper, we argue that the task execution delay may also impact the response time, especially in the hot-spot data centers. We define geo-distributed workload-aware scheduling problem, aiming to minimize the overall delay of data transmission and task execution. And then, we prove it to be NP-complete and propose an on-line heuristic to effectively re-distribute dataset and tasks, which potentially balances the workload among data centers and optimizes the overall response time. Experiments show that our algorithm has a significant performance improvement which covers wide range of data distribution, and could reduce up to 55% job response time on average.
The Internet of things is foreseen as one of the next imminent Internet revolutions, as many devices will seamlessly communicate together to provide new and exciting services to the end users. One of the challenges th...
详细信息
ISBN:
(纸本)9781509011322
The Internet of things is foreseen as one of the next imminent Internet revolutions, as many devices will seamlessly communicate together to provide new and exciting services to the end users. One of the challenges that the IoT has to face is about both the heterogeneity of the data available and the heterogeneity of the communication. In this paper we focus on the former, by presenting an architecture able to integrate data coming from different sources, including custom made deployments and government data. New services can be deployed directly by the end users, using reliable or unreliable data sources, and new processed data can be gathered by these services and used by others.
The data warehouse system Hive has emerged as an important facility for supporting data computing and storage. In particular, RCFile is a tailor-made data placement structure implemented in Hive, which is designed for...
详细信息
ISBN:
(纸本)9781467390064
The data warehouse system Hive has emerged as an important facility for supporting data computing and storage. In particular, RCFile is a tailor-made data placement structure implemented in Hive, which is designed for the data processing efficiency. In this paper, we propose several optimized schemes based on RCFile and introduce EStore, which is an optimized data placement structure that is able to improve the query rate and reduce storage space for Hive. Specifically, it adopts both row-store and column-store in blocks, and further classifies the columns by the frequency of each table-column. Moreover, we also employ the classic RDP code to store files of the data table. We conduct experiments on a real cluster, and the results show that EStore has better features in terms of data query rate and storage space compared with RCFile.
The linkage between healthcare service and cloud computing techniques has drawn much attention lately. Up to the present, most works focus on IT system migration and the management of distributed healthcare data rathe...
详细信息
The linkage between healthcare service and cloud computing techniques has drawn much attention lately. Up to the present, most works focus on IT system migration and the management of distributed healthcare data rather than taking advantage of information hidden in the data. In this paper, we propose to explore healthcare data via cloud-based healthcare data mining services. Specifically, we propose a cloud-based healthcare data mining framework for healthcare data mining service development. Under such framework, we further develop a cloud-based healthcare data mining service to predict patients future length of stay in hospital.
Regenerating codes are efficient methods for distributed storage in practical networks where node failures are common. They guarantee low cost data reconstruction and repair through accessing only a predefined number ...
详细信息
ISBN:
(纸本)9781509018079
Regenerating codes are efficient methods for distributed storage in practical networks where node failures are common. They guarantee low cost data reconstruction and repair through accessing only a predefined number of arbitrary chosen storage nodes in the network. In this work we study the fundamental limits of required total repair bandwidth and the storage capacity of these codes under the assumption that i) both data reconstruction and repair are resilient to the presence of a certain number of erroneous nodes in the network and ii) the number of helper nodes in every repair is not fixed, but is a flexible parameter that can be selected during the run-time. We focus on the minimum repair bandwidth point in this work, propose the associated coding scheme to posses both these extra properties, and prove its optimality.
We present an adaptive data aggregative window function (A-DAWF) for a distributed sensor network model in which nodes store data in their attribute window functions, and provide non-correlated data towards the base s...
详细信息
ISBN:
(纸本)9781509021864
We present an adaptive data aggregative window function (A-DAWF) for a distributed sensor network model in which nodes store data in their attribute window functions, and provide non-correlated data towards the base station (BS). Unlike previous works, namely data collection or data gathering management systems, we propose a novel approach that aims to process temporal redundant techniques in sensor nodes as well as providing spatial redundant filtration methods in cluster-head (CH) nodes. In this regard, preliminary results show that A-DAWF can suppress up to 90% of temporal redundant data among the considered sensor nodes by an optimal threshold of the window sizes, and their spatial correlations in CH node by a maximum error threshold compared to either periodic or a continuous data transmission system.
This paper presents a detailed comparative study of centralized mobility management and distributed mobility management (DMM). This paper proposed a modified mechanism of pmipv6-based partially distributed mobility ma...
详细信息
This paper presents a detailed comparative study of centralized mobility management and distributed mobility management (DMM). This paper proposed a modified mechanism of pmipv6-based partially distributed mobility management operation procedure based on DMM scenario in Internet Engineering Task Force (IETF). Binding update list (BUL) and binding cache entry (BCE) were created by Mobility Anchor and Access Router (MAAR) during the proxy binding update (PBU) and proxy binding acknowledgement (PBA) message. Experimental results illustrate that MAARs in DMM can distribute and manage IP data packets follow and the new modification concept and simulation models are logically correct.
暂无评论