Analysis of the existing techniques for approximate query processing of Big Data, based on sampling, histograms and wavelets, demonstrates that wavelet-based methods can be effectively utilized for OLAP purposes due t...
详细信息
Analysis of the existing techniques for approximate query processing of Big Data, based on sampling, histograms and wavelets, demonstrates that wavelet-based methods can be effectively utilized for OLAP purposes due to their advantages in terms of handling multidimensional data and querying single cells as well as aggregate values from a data warehouse. At the same time the current wavelet-based methods for approximate query processing have certain deficiencies making difficult to implement them in practice. In particular, most of the techniques struggle with arbitrarily size data either imposing a restriction on a dimension length to be a multiple of a power of two, or complicating decomposition algorithms what leads to the construction time increase and difficulties with error estimations. Also, there is a lack of methods for approximate processing based on wavelets with a bounded error and a confidence interval for both single and aggregate values. Our contribution in this paper is introduction of a new wavelet method for approximate query processing which handles arbitrarily sized multidimensional datasets with minor extra calculations and provides a bounded error of the single or aggregate value reconstruction. It is demonstrated that the new method allows evaluating a confidence interval of the query error depending on a given compression ratio of a data warehouse or performing an inverse task, i.e. evaluating the required data warehouse compression ratio for a given allowable error. The introduced method was applied and verified over real epidemiological datasets to support research in finding correlations and patterns in disease spread and clinical signs correlations. It was demonstrated that the accuracy of the estimated error is acceptable for retrieving single and aggregate value, query time processing advantage depends on compression ratio and volume of the processed data.
Data outsourcing allows data owners to keep their data in public clouds, which do not ensure the privacy of data and computations. One fundamental and useful framework for processing data in a distributed fashion is M...
详细信息
The #SAT problem, that is counting the number of solutions of a propositional formula, extends the well-known SAT problem into the realm of probabilistic reasoning. However, the higher computational complexity and lac...
详细信息
ISBN:
(纸本)9781509036547
The #SAT problem, that is counting the number of solutions of a propositional formula, extends the well-known SAT problem into the realm of probabilistic reasoning. However, the higher computational complexity and lack of fast solvers still limits its applicability for real world problems. In this work we present our distributedparallel #SAT solver dCountAntom which utilizes both local, shared-memory parallelism as well as distributed (cluster computing) parallelism. Although highly parallel solvers are known in SAT solving, such techniques have never been applied to the #SAT problem. Furthermore we introduce a solve progress indicator which helps the user to assess whether the presented problem is likely solvable within a reasonable time. Our analysis shows a high accuracy of the estimated progress. Our experiments with up to 256 CPU cores working in parallel yield large speedups across different benchmarks derived from real world problems: With the maximum number of available cores dCountAntom solved problems on average 141 times faster than a single core implementation.
Implementing trajectory data stream analysis in parallel has technical issues of data partition and improvements of the analysis operations. In this paper, we define the trajectory analysis problem as discovering traj...
详细信息
ISBN:
(纸本)9781467390064
Implementing trajectory data stream analysis in parallel has technical issues of data partition and improvements of the analysis operations. In this paper, we define the trajectory analysis problem as discovering trajectory companies of moving objects. We develop a discovery workflow in parallel batch processing. We solve technical issues of data partition and data locality in the steps of analysis operations. Our techniques focus on different partition methods, and observe the effects on the execution time and data locality by varying the operators of the workflow. We demonstrate our parallel implementation techniques using Apache Spark to process real GPS trajectory data on an Amazon Web Service cluster.
The aim of this paper is to present a new distributed computing middleware for High Performance Computing (HPC) based cloud micro-services. The great challenge is to maintain the scalability and efficiency of massivel...
详细信息
The aim of this paper is to present a new distributed computing middleware for High Performance Computing (HPC) based cloud micro-services. The great challenge is to maintain the scalability and efficiency of massively parallel and distributed computational system when the intensive big data processed by its applications is widely increased. Besides, the proposed middleware implements a new cooperative micro-services team works model for massively parallel and distributed computing. This model is constituted by distributed micro-services as Micro-service Virtual processing Units (MsVPUs) with integrated load balancing service and an AMQP communication protocol that grant HPC. The paper shows the proposed distributed computational scheme and its integrated middleware accompanying by some experimental results.
A parallel module for applications based on overlapping grids has been devised and implemented in JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure). In this module, a patch-based data structure,...
详细信息
A parallel module for applications based on overlapping grids has been devised and implemented in JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure). In this module, a patch-based data structure, a grid mapping method and a unified communication schedule have been designed and adopted to overcome the communication bottleneck broadly existing in overlapping grids parallel computing. A grid mapping method library has been designed to make the module be adaptive to all kinds of structured grids, and an interpolator library has also been designed to gather interpolators. Meanwhile, by encapsulating parallel computing strategies, such as distributed storage, data communications, etc. and providing standard interfaces, this module can help users realize overlapping grids parallel computing conveniently. According to our test results, applications based on this module can be run efficiently on thousands of processors, which prove the module's satisfying parallel performance.
For optical remote sensing images, an effective method to reduce or eliminate the impact of clouds is important. With big data input and real-time processing demands, efficient parallelization strategies are essential...
详细信息
For optical remote sensing images, an effective method to reduce or eliminate the impact of clouds is important. With big data input and real-time processing demands, efficient parallelization strategies are essential for high performance computing on multi-core systems. This paper proposes an efficient high performance parallel computing framework for cloud filtering and smoothing. A comparison and benchmarking of two parallel algorithms for cloud filtering that incorporates spatial smoothing solved by two-dimensional dynamic programming is implemented. The experiments were carried out on an NVIDIA GPU accelerator with evaluations of approximation, parallelism and performance. The test results show significant performance improvements with high accuracy compared with sequential CPU implementation, and can be applied to other multi-core systems.
applications typically exhibit extremely different performance characteristics depending on the accelerator. Back propagation neural network (BPNN) has been parallelized into different platforms. However, it has not y...
详细信息
ISBN:
(纸本)9781509053827
applications typically exhibit extremely different performance characteristics depending on the accelerator. Back propagation neural network (BPNN) has been parallelized into different platforms. However, it has not yet been explored on speculative multicore architecture thoroughly. This paper presents a study of parallelizing BPNN on a speculative multicore architecture, including its speculative execution model, hardware design and programming model. The implementation was analyzed with seven well-known benchmark data sets. Furthermore, it trades off several important design factors in coming speculative multicore architecture. The experimental results show that: (1) the BPNN performs well on speculative multicore platform. It can achieve similar speedup (17.7x to 57.4x) compared with graphics processors (GPU) while provides a more friendly programmability. (2) 64 cores' computing resources can be used efficiently and 4k is the proper speculative buffer capacity in the model.
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. Checkpointing is one of the most popular fault tolerance techni...
详细信息
ISBN:
(纸本)9781479984909
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. Checkpointing is one of the most popular fault tolerance techniques. However, checkpointing cost in terms of computing time, network utilization or storage resources can be a limitation for its practical use. This work proposes different techniques for the optimization of the I/O cost in the checkpointing of shared-memory parallelapplications. The proposals are extensively evaluated using the OpenMP NAS parallel Benchmarks. Results show a significant decrease of the checkpointing overhead.
Graphs are increasingly being used as the data structure of choice to represent interactions between heterogeneous entities. Graph path querying is a primary operation in the network graph space, for both real time qu...
详细信息
Graphs are increasingly being used as the data structure of choice to represent interactions between heterogeneous entities. Graph path querying is a primary operation in the network graph space, for both real time querying and inferential analysis. The rate and volume of interconnected data being generated warrants efficient distributed solutions to manage and query network graphs in a scalable fashion. Existing distributed solutions have proposed several optimization techniques, including intelligent joins and partial evaluations to process path queries. However, the former relies on comprehensive indices while the latter involves extensive driver-side processing to combine the partial results, neither of which is efficient for processing large graphs. In this paper, we propose a novel distributed graph path query processing system using the Apache Spark framework.
暂无评论