distributed vertex-centric graph processing systems have been recently proposed to perform different types of analytics on large graphs. these systems utilize the parallelism of shared nothing clusters. In this work w...
详细信息
distributed vertex-centric graph processing systems have been recently proposed to perform different types of analytics on large graphs. these systems utilize the parallelism of shared nothing clusters. In this work we propose a novel model for the performance cost of such *** also define novel metrics related to the workload balance and network communication cost of clusters processing massive real graph datasets. We empirically investigate the effects of different graph partitioning mechanisms and their tradeoff for two different categories of graph processing algorithms.
Storing highly skewed data in a distributed system has become a very frequent issue, in particular withthe emergence of semantic Web and Big Data. this often leads to biased data dissemination among nodes. Addressing...
详细信息
Storing highly skewed data in a distributed system has become a very frequent issue, in particular withthe emergence of semantic Web and Big Data. this often leads to biased data dissemination among nodes. Addressing load imbalance is necessary, especially to minimize response time and avoid workload being handled by only one or few nodes. Our contribution aims at dynamically managing load imbalance by allowing multiple hash functions on different peers, while maintaining consistency of the overlay. Our experiments, on highly skewed data sets from the semantic web, show we can distribute data on at least 300 times more peers than when not using any load balancing strategy.
Cloud database is now a rapidly growing trend in cloud computing market recently. It enables the clients run their computation on out-sourcing databases or access to some distributed database service on the cloud. At ...
详细信息
the emergence of Big Data applications provides new challenges in data management such as processing and movement of masses of data. Volunteer computing has proven itself as a distributed paradigm that can fully suppo...
详细信息
the emergence of Big Data applications provides new challenges in data management such as processing and movement of masses of data. Volunteer computing has proven itself as a distributed paradigm that can fully support Big Data generation. this paradigm uses a large number of heterogeneous and unreliable Internet-connected hosts to provide Peta-scale computing power for scientific projects. Withthe increase in data size and number of devices that can potentially join a volunteer computing project, the host bandwidth can become a main hindrance to the analysis of the data generated by these projects, especially if the analysis is a concurrent approach based on either in-situ or in-transit processing. In this paper, we propose a bandwidth model for volunteer computing projects based on the real trace data taken from the Docking@Home project with more than 280,000 hosts over a 5-year period. We validate the proposed statistical model using model-based and simulation-based techniques. Our modeling provides us with valuable insights on the concurrent integration of data generation with in-situ and in-transit analysis in the volunteer computing paradigm.
Data-intensive scientific applications are posing many challenges in distributedcomputing systems. In the scientific field, the application data are expected to double every year over the next decade and further. Wit...
详细信息
ISBN:
(纸本)9781479967162
Data-intensive scientific applications are posing many challenges in distributedcomputing systems. In the scientific field, the application data are expected to double every year over the next decade and further. Withthis continuing data explosion, high performance computing systems are needed to store and process data efficiently, and workflow technologies are facilitated to automate these scientific applications. Scientific workflows are typically very complex. they usually have a large number of tasks and need a long time for execution. Running scientific workflow applications usually need not only high performance computing resources but also massive storage. the emergence of cloud computingtechnologies offers a new way to develop scientific workflow systems. Scientists can upload their data and launch their applications on the scientific cloud workflow systems from everywhere in the world via the Internet, and they only need to pay for the resources that they use for their applications. As all the data are managed in the cloud, it is easy to share data among scientists. this kind of model is very convenient for users, but remains a big challenge to the system. this paper proposes several research topics of data management in scientific cloud workflow systems, and discusses their research methodologies and state-of-the-art solutions.
the performance of the sparse matrix-vector multiplication (SMVM) on a parallel system is strongly affected by the distribution of data among its components. Two costs arise as a result of the used data mapping method...
详细信息
the performance of the sparse matrix-vector multiplication (SMVM) on a parallel system is strongly affected by the distribution of data among its components. Two costs arise as a result of the used data mapping method: arithmetic and communication. the communication cost often dominates the arithmetic cost, and the gap between these costs tends to increase. therefore, finding a mapping method that reduces the communication cost is of high importance. On the other hand, the load distribution among the processing units must not be sacrificed. In this paper, a data mapping method is proposed for SMVM on Network-on-Chip which achieves balanced working load and reduces the communication cost. Afterwards, an FPGA-based architecture is introduced which is designed to fit withthe proposed data mapping method.
the MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be impr...
详细信息
the MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be improved by adding GPU capabilities. In this context, the GPU clusters for large scale computing can bring a considerable increase in the efficiency and speedup of data intensive applications. In this article we present a framework for executing MapReduce using GPU programming. We describe several improvements to the concept of GPU MapReduce and we compare our solution with others.
K-Means, a simple but effective clustering algorithm, is widely used in data mining, machine learning and computer vision community. K-Means algorithm consists of initialization of cluster centers and iteration. the i...
详细信息
K-Means, a simple but effective clustering algorithm, is widely used in data mining, machine learning and computer vision community. K-Means algorithm consists of initialization of cluster centers and iteration. the initial cluster centers have a great impact on cluster result and algorithm efficiency. More appropriate initial centers of k-Means can get closer to the optimum solution, and even much quicker convergence. In this paper, we propose a novel clustering algorithm, Kmms, which is the abbreviation of k-Means and Mean Shift. It is a density based algorithm. Experiments show our algorithm not only costs less initialization time compared with other density based algorithms, but also achieves better clustering quality and higher efficiency. And compared withthe popular k-Means++ algorithm, our method gets comparable accuracy, mostly even better. Furthermore, we parallelize Kmms algorithm based on OPenMP from both initialization and iteration step and prove the convergence of the algorithm.
TOUGH2 is a general-purpose numerical simulation program for multi-dimensional, multiphase, multicomponent fluid flows, heat transfer and contaminant transport in porous and fractured media. It has been used worldwide...
详细信息
TOUGH2 is a general-purpose numerical simulation program for multi-dimensional, multiphase, multicomponent fluid flows, heat transfer and contaminant transport in porous and fractured media. It has been used worldwide for geothermal reservoir engineering, nuclear waste isolation, environmental assessment and remediation, and modeling flow and transport in variably saturated media. TOUGH2 is very computationally intense, and the accuracy and scope of the simulation is limited by the amount of processing power available on a single computer. this makes it an ideal canadate for parallelcomputing, as more CPU power and memory is available. Furthermore, TOUGH2's main computational unit is a linear equation solver. In parallelcomputing, a lot of effort has been spent to develop highly efficient parallel linear equation solvers. In this paper, we present TOUGH2-PETSc, a parallel implementation of TOUGH2 that uses PETSc to solve the linear systems in TOUGH2. PETSc is a library of high-performance linear and non-linear equation solvers that has been throughly tested at scale. Based on TOUGH2 and PETSc, TOUGH2-PETSc gives TOUGH2 users the potential to perform larger scale and higher resolution simulations. Experimental results demonstrate that the parallel TOUGH2-PETSc shows improved performance over the sequential version.
Due to their inherent parallel and non-deterministic nature, P system implementations require vast computing and storage resources. this significantly limits their applications, even more so when the calculation of al...
详细信息
暂无评论