This paper proposes a two-level caching strategy for Web search queries which is devised to operate on P2P networks. The aim is to significantly reduce query traffic going from a large community of users to commercial...
详细信息
This paper proposes a two-level caching strategy for Web search queries which is devised to operate on P2P networks. The aim is to significantly reduce query traffic going from a large community of users to commercial search engines by placing between them a P2P caching service capable of storing and efficiently distributing frequent queries among users. The proposed design takes into consideration the highly dynamic nature of user queries both in traffic intensity and drastic shifts in user interest which are both usually driven by unpredictable world-wide events. Each peer maintains a LRU result cache (RCache) used to keep the answers for queries originated in the peer itself and queries for which the peer is responsible for by contacting on-demand a Web search engine to get the query answers. When query traffic is predominantly routed to a few responsible peers our strategy replicates the role of ``being responsible for" to neighboring peers so that they can absorb part of the traffic to restore load balance. This is a fairly slow and adaptive process that we call mid-term load balancing. To achieve a short-term fair distribution of queries we introduce in each peer a location cache (LCache) which keeps pointers to peers that have already requested the same queries in the very recent past. This lets these peers share their query answers with newly requesting peers. This process is fast as these popular queries are usually cached in the first DHT hop of a requesting peer which quickly tends to redistribute load among more and more peers. A comparative study shows that the proposed strategy achieves better load balance, significantly smaller communication volume among peers, and larger cache hit ratios than previous strategies.
In a broadcasting task, source node wants to send the same message to all the other nodes in the network. Existing solutions address specific mobility scenarios, e.g. connected dominating set (CDS) based for static ne...
详细信息
In a broadcasting task, source node wants to send the same message to all the other nodes in the network. Existing solutions address specific mobility scenarios, e.g. connected dominating set (CDS) based for static networks, blind flooding for moderate mobility, hyper flooding for highly mobile and frequently partitioned networks. We are interested in designing a unique protocol that will seamlessly (without using any parameter) adjust itself to any mobility scenario, and with capability to address various model assumptions and optimality criteria. Existing approaches for all scenarios are based on some threshold parameters (e.g. speed) to locally select among different algorithms, and therefore different nodes may run different algorithms. Here we describe a novel general BSM (Broadcasting from Static to Mobile) framework, built over several recent algorithms handling special cases. It aims at high delivery rate with low message cost, and addresses intermittent connectivity and delay minimization. Each node activates with respect to broadcast message whenever it identifies one or more neighbors in need of the message. It selects, upon activation, waiting time (dynamically adjusted with reception of any message) depending on the number of such neighbors, distance to their centroid, and its CDS membership. It competes (at MAC layer) to retransmit at timeout expiry if it still believes that a neighbor needs message. We map this algorithm to variety of multi-hop wireless networks scenarios, with and without: positional information, acknowledgments, and time criticality goal. Some existing solutions are derived as special cases, and we also show how to timely deliver warnings in vehicular networks with arbitrary road structure, without using road maps.
Fault-tolerance is rapidly becoming a crucial issue in high-end and distributedcomputing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpoin...
Fault-tolerance is rapidly becoming a crucial issue in high-end and distributedcomputing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.
The recent trend to get the performance report of any criteria has been feedback mining of a set of results to get a perfect picture of the performance. Here the main aspect taken is the mining of examination results ...
详细信息
The recent trend to get the performance report of any criteria has been feedback mining of a set of results to get a perfect picture of the performance. Here the main aspect taken is the mining of examination results to get a feedback report of the performance of a student. Here the main challenge is that as the number of students increase the usage of a single node makes feedback mining very slow. So the proper solution for this challenge is using e-learning systems along with facilities of grid computing which is our proposed system. The proposed system is the combination of Hyper Grid Learning System (HGLS) & Hpc resource kit (High performance computing resource kit). This system effectively reduces the operation time of feedback mining in server as distributedcomputing is used. The proposed system is more efficient than the existing system as it performs the computational work with less complexity and high efficiency.
The purpose of this talk is to provide a comprehensive state of the art concerning the evolution of data management systems from uni-processor systems to large scale distributedsystems. We focus our study on the quer...
详细信息
ISBN:
(纸本)9781450313070
The purpose of this talk is to provide a comprehensive state of the art concerning the evolution of data management systems from uni-processor systems to large scale distributedsystems. We focus our study on the query processing and optimization methods. For each environment, we recall their motivations and point out main characteristics of proposed methods, especially, the nature of decision-making (centralized or decentralized control for high level of scalability), adaptive level (intra-operator and/or inter-operator), impact of parallelism (partitioned and pipelined parallelism) and dynamicity (e.g. elasticity) of execution models.
Every day, we create 2.5 quintillion bytes of data - so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate inform...
详细信息
Every day, we create 2.5 quintillion bytes of data - so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. The IDC sizing of the digital universe - information that is either created or captured in digital form and then replicated in 2006 - is 161 Exabyte, growing to 988 Exabyte in 2010, representing a compound annual growth rate (CAGR) of 57%. A variety of system architectures have been implemented for data-intensive computing and large-scale data analysis applications including parallel and distributed relational database management systems which have been available to run on shared nothing clusters of processing nodes for more than two decades. However most data growth is with data in unstructured form and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others. 20% of the world's servers go into huge data centers by the “Big 5” - Google, Microsoft, Yahoo, Amazon, eBay [1].
The current trend in distributed energy system control is moving toward implementing quasi-decentralized control strategy where the control signals are exchanged over shared dedicated control or communication networks...
详细信息
The current trend in distributed energy system control is moving toward implementing quasi-decentralized control strategy where the control signals are exchanged over shared dedicated control or communication networks which reduces the complexity in the wiring and enhances the reliability. Transmitting control signals through shared networks induces time delays and data losses which may destabilize the system. So it is important to identify what the maximum allowable delay bound (MADB) can be tolerant with respect to system stability. The paper proposes a new method for estimating the allowable time delay under which the system stability can be maintained. The proposed method is originated from analyzing the network controlled parallel DC/DC Buck converters. The influences of the system parameters and controllers on the boundary of the MADB are studied. Simulation using Matlab SimPowerSystem Toolbox is performed to verify the stability bound determined by the MADB. The results are compared with the methods previously reported in the literature and the new method is proved to be simpler in estimation procedures and easier in applications to practical systems.
In this paper we describe use of cloud computing platform for support of distributed creation of conceptual models based on the FCA (Formal Concept Analysis) framework. FCA is one of the approaches which can be applie...
详细信息
In this paper we describe use of cloud computing platform for support of distributed creation of conceptual models based on the FCA (Formal Concept Analysis) framework. FCA is one of the approaches which can be applied in process of conceptual data analysis. Extension of classical FCA (binary table data) is (one-sided) fuzzy version that works with different types of lattice-based attributes (binary, ordinal, interval-based, etc.) in the object-attribute table. This extension, so-called generalized one-sided concept lattices, provide possibility for researcher or data analyzer to use fuzzy FCA for object-attribute tables without the need for specific unified pre-processing, what is usually expected in practical data mining or online analytical tools. Computational complexity of creation of concept lattices from large contexts (data tables) is considerable, also interpretability of huge concept lattices is problematic. Therefore, we will also propose a solution for creation of simple hierarchy of smaller FCA models. Starting data table is decomposed into smaller sets of objects and then one concept lattice is built for every subset using generalized one-sided concept lattice. Such small FCA-based models are better for interpretability, and also can be combined into one hierarchy of models using simple hierarchical clustering based on the descriptions of particular models (as weighted vectors of attributes), which can be searched in analytical tool by data analyst. Cloud infrastructure is then used for increase of computational effectiveness, because particular models are built in parallel/distributed way. This cloud module can be a part of more complex data analytical system, which is also presented at the end of the paper.
MapReduce is a highly efficient parallel large data sets (greater than 1TB) processing model, widely used in cloud computing environment. The current mainstream cloud computing service providers have adopted Hadoop,...
详细信息
MapReduce is a highly efficient parallel large data sets (greater than 1TB) processing model, widely used in cloud computing environment. The current mainstream cloud computing service providers have adopted Hadoop, the open source MapReduce implementation, to build their cloud computing platform. However, like all open distributedcomputing frameworks, MapReduce suffers from the service integrity assurance vulnerability: it takes merely one malicious worker to render the overall computation result useless. It is very important to efficiently detect the malicious workers in cloud computing environment. Existing solutions are not effective enough in defeating the malicious behaviour of non-collusive and collusive workers. In this paper, we focus on the mappers, which typically constitute the majority of workers. On the basis of the existing frameworks, we make the master manage the computing workers based on the security levels, and introduce the trusted verifier worker and caching mechanism. According to the system analysis, the service integrity assurance framework suggested in this paper is more efficient and accurate to detect the malicious workers in the cloud computing environment based on MapReduce.
The continuous growth of social web applications along with the development of sensor capabilities in electronic devices is creating countless opportunities to analyze the enormous amounts of data that is continuously...
详细信息
ISBN:
(纸本)9781467300421
The continuous growth of social web applications along with the development of sensor capabilities in electronic devices is creating countless opportunities to analyze the enormous amounts of data that is continuously steaming from these applications and devices. To process large scale data on large scale computing clusters, MapReduce has been introduced as a framework for parallelcomputing. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature, and input data streams are continuously received at high arrival rates. In this demonstration, we showcase M3, a prototype implementation of the MapReduce framework in which continuous queries over streams of data can be efficiently answered. M3 extends Hadoop, the open source implementation of MapReduce, bypassing the Hadoop distributed File System (HDFS) to support main-memory-only processing. Moreover, M3 supports continuous execution of the Map and Reduce phases where individual Mappers and Reducers never terminate.
暂无评论