Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets con...
详细信息
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributedframework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.
Nowadays, data privacy is one of the most critical concerns in cloud computing, and many privacy-preserving distributedcomputing systems based on the trusted execution environment (e.g., Intel SGX) have been proposed...
详细信息
Nowadays, data privacy is one of the most critical concerns in cloud computing, and many privacy-preserving distributedcomputing systems based on the trusted execution environment (e.g., Intel SGX) have been proposed to protect the user's privacy during cloud-outsourced computation. However, these SGX-based solutions are vulnerable to some traffic analyses, and loading all tasks into the enclave introduces much overhead for frequent EPC-paging. In this article, we propose a T-SGX framework, which keeps the confidentiality of a distributed job and guarantees the system efficiency by allowing dynamically loading an enclave shared object for the task under processing. In T-SGX, all these objects are secretly shared and stored in a verifiably distributed share management system (SMS) outside the TCB. To mitigate the exposure of sensitive information, we present an efficient oblivious transfer (OT) protocol under the Decisional Diffie-Hellman (DDH) assumption for obliviously transmitting desired shares. Detailed security analysis demonstrates that the proposed T-SGX achieves the goal of secure distributedcomputing without privacy leakage to unauthorized parties. Finally, we benchmark the framework in six real-world applications, and the experimental results show that T-SGX significantly outperforms a state-of-the-art solution, with 11.9%-29.7% less overhead performing an SGX-based application.
With the increase in smart devices, spatiotemporal data has grown exponentially. To deal with challenges caused by an increase data requires a scalable and efficient architecture that can store, query, analyze, and vi...
详细信息
With the increase in smart devices, spatiotemporal data has grown exponentially. To deal with challenges caused by an increase data requires a scalable and efficient architecture that can store, query, analyze, and visualize spatiotemporal big data. This paper describes a Cloud-terminal integrated GIS platform architecture designed to meet the requirements of processing and analyzing spatiotemporal big data. Cloud terminal Integration GIS is developed according to the architecture. Extensive experiments deployed on the internal organization cluster using real-time datasets showed that the SuperMap GIS spatiotemporal big data engine achieved excellent performance. (C) 2018 Elsevier B.V. All rights reserved.
Performing Process Mining by analyzing event logs generated by various systems is a very computation and I/O intensive task. distributedcomputing and Big Data processing frameworks make it possible to distribute all ...
详细信息
ISBN:
(纸本)9781467387767
Performing Process Mining by analyzing event logs generated by various systems is a very computation and I/O intensive task. distributedcomputing and Big Data processing frameworks make it possible to distribute all kinds of computation tasks to multiple computers instead of performing the whole task in a single computer. This paper assesses whether contemporary structured query language (SQL) supporting Big Data processing frameworks are mature enough to be efficiently used to distribute computation of two central Process Mining tasks to two dissimilar clusters of computers providing BPM as a service in the cloud. Tests are performed by using a novel automatic testing framework detailed in this paper and its supporting materials. As a result, an assessment is made on how well selected Big Data processing frameworks manage to process and to parallelize the analysis work required by Process Mining tasks.
In order to improve spatial operations efficiency of massive data in distributed environment and to solve the interactive design problems of spatial analysis processing module designed to service agreement with the un...
详细信息
ISBN:
(纸本)9781424473021
In order to improve spatial operations efficiency of massive data in distributed environment and to solve the interactive design problems of spatial analysis processing module designed to service agreement with the underlying database, spatial data models, map display and so on.. For the status quo that there is no GIS software for a practical analysis of distributedcomputing, we have carried out in-depth study combined with the distributed characteristics of spatial data and information. The distributed geospatial information operation framework was designed in this paper. The basic characteristics of distributedcomputing are analyzed in this paper. The author of this paper discussed the distributedcomputing spatial information technology system form following aspects: apace computing task decomposition, distributed spatial data classification method, sharing data replication strategy, the data partitioning strategy based on the load and the caching mechanism of space computingframework, based on this framework, the author has developed the system for resolving the practical problems. In this paper, the proposed distributed computing framework suitable for distributed spatial analysis has solved the key technical problems of distributed spatial analysis computingframework. And it is accordant with "service-oriented" thinking, takes into account the heterogeneity of spatial data sources, and the distributed spatial computing among the different systems on different platforms. The dynamic load scheduling has improved the static data partitioning method, it avoids the load imbalance problem in the phase of static data partitioning. It solved the efficiency of large-scale spatial data operations in the complex distributed environment in practical applications. At last, based on the software, we do the distributed clipping computing environment test of the classic space experiments, a detailed result has given at the last of the article, it has shown that, the fram
The emergence of Internet information technology has led to the development of MOOC-based online teaching methods. The study uses the traditional C4.5 algorithm for data mining to improve teaching quality and simplifi...
详细信息
The emergence of Internet information technology has led to the development of MOOC-based online teaching methods. The study uses the traditional C4.5 algorithm for data mining to improve teaching quality and simplifies and quantifies it with the Taylor series and GINI index. The study also considers the uncertainty of data changes and the characteristics of MOOC teaching to design a parallel processing system of the HD-TG-C4.5 algorithm under the framework of the Hadoop platform. The experimental results show that the minimum data classification error of the algorithm is 2%, and the maximum recommendation accuracy of teaching resources is 92.6%. Moreover, the response time and resource search time of this algorithm system are significantly better than traditional algorithms in terms of system debugging. The average login response time is less than 0.87 s, and the success rate of system debugging reaches 90%. The probability value of students mastering teaching resource knowledge points is also above 0.7. The MOOC teaching system based on TG-C4.5 algorithm can effectively mine learner behavior data and reduce the complexity and consumption of C4.5 algorithm. The MOOC teaching system based on TG algorithm can provide technical support for the decision-making information of teaching participants and provide early warning information for predicting learning behavior.
暂无评论