There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for...
详细信息
ISBN:
(纸本)9781450393850
There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design SPARX, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up;either by large number of points or high dimensionality, whereas SPARX yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source SPARX under the Apache license.(1)
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for ...
详细信息
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7;and 2.5 & x00D7;-455.7 & x00D7;speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.
Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-...
详细信息
Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about 1.29x, 3.53x and 2.21x speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about 12.8x, 5.1x and 27.2x speedup, respectively.
This work presents a general approach, coined "Spy Element Method" (SEM), for parallelising workloads to run on manycore processors and featuring dynamic dependencies between different data items, such as gr...
详细信息
This work presents a general approach, coined "Spy Element Method" (SEM), for parallelising workloads to run on manycore processors and featuring dynamic dependencies between different data items, such as graph traversals, remeshing methods and particle simulations. In the SEM, appropriately defined entities, denominated "spy elements", inspect their neighbourhood and without atomic memory operations update in parallel all the dynamic data dependencies, including those between themselves, between the original objects of the current problem and among entities of different kind. The application of the SEM to meshless simulation models obviates the use of binning algorithms relying on sorting or atomics, and concomitantly renders their implementation fully particle centric. On the NVIDIA GeForce GTX 480, an optimised particle-based fluid simulator runs 1.4-2.3 times faster when the binning aided by the state-of-the-art parallel sorter is eliminated and data updating is delegated to the SEM, and is at least 40 times faster than a highly optimised single-threaded version running on a single core of the Intel Xeon X5650 2.66 GHz. (c) 2012 Elsevier B.V. All rights reserved.
The direct simulation Monte Carlo (DSMC) is a computational method for fluid mechanics simulation in the regime of rarefied gas flow. It is a numerical solution of the Boltzmann equation based on an individual particl...
详细信息
The direct simulation Monte Carlo (DSMC) is a computational method for fluid mechanics simulation in the regime of rarefied gas flow. It is a numerical solution of the Boltzmann equation based on an individual particle basis. Accurate simulations typically require particle numbers in the range of hundreds of thousands to millions. Such large simulations require an inordinate amount of time for processing using serial computing on central processing units (CPUs). In this paper we investigate data-parallel techniques on graphics processing units (GPUs) to execute very large scale DSMC simulations. We have designed and implemented Bird's method on a three-dimensional simulation domain that includes complex geometry interactions. We also have tested and verified the statistical and theoretical accuracy of our implementation. Our results show substantial performance improvements (nearly two orders of magnitude) over Bird's serial implementation without loss of accuracy.
In this paper we present a set of graphics hardware accelerated algorithms to interactively evaluate the machinability of complex free-form surfaces. These algorithms work in image space and easily interface with all ...
详细信息
In this paper we present a set of graphics hardware accelerated algorithms to interactively evaluate the machinability of complex free-form surfaces. These algorithms work in image space and easily interface with all common formats available on CAD systems. The running time of these algorithms is independent of the complexity of the surface to be analyzed and depends only on the size of the projected image of the surface and the largest available tool head. Interactive speed is achieved through clever use of data-parallel techniques that map nicely onto the programming model of modern programmable graphics processing units. We demonstrate a method for pre-calculating and storing the machinability of a surface within a texture to further reduce rendering costs. The algorithms are implemented and tested on a complex set of parts and their performance has been analyzed. (C) 2010 Elsevier Ltd. All rights reserved.
In this paper we present a set of graphics hardware accelerated algorithms to interactively evaluate the machinability of complex free-form surfaces. These algorithms work in image space and easily interface with all ...
详细信息
In this paper we present a set of graphics hardware accelerated algorithms to interactively evaluate the machinability of complex free-form surfaces. These algorithms work in image space and easily interface with all common formats available on CAD systems. The running time of these algorithms is independent of the complexity of the surface to be analyzed and depends only on the size of the projected image of the surface and the largest available tool head. Interactive speed is achieved through clever use of data-parallel techniques that map nicely onto the programming model of modern programmable graphics processing units. We demonstrate a method for pre-calculating and storing the machinability of a surface within a texture to further reduce rendering costs. The algorithms are implemented and tested on a complex set of parts and their performance has been analyzed. (C) 2010 Elsevier Ltd. All rights reserved.
As computational performance continues to scale in chip multiprocessors with increasing numbers of cores, the gap between the available off-chip bandwidth and that which is required to appropriately feed the processor...
详细信息
ISBN:
(纸本)9781424453696
As computational performance continues to scale in chip multiprocessors with increasing numbers of cores, the gap between the available off-chip bandwidth and that which is required to appropriately feed the processors continues to widen under current architectures. For many high-performance computing applications, the bandwidth available for both on- and off-chip communications can play a vital role in efficient execution due to the use of data-parallel or data-centric algorithms. Electronic interconnected systems are increasingly bound by their communications infrastructure and the associated power dissipation of high-bandwidth data movement. Recent advances in chip-scale silicon photonic technologies have created the potential for developing optical interconnection networks that can offer highly energy efficient communications and significantly improve computing performance-per-Watt. This paper examines the design and performance of photonic networks-on-chip architectures that support both on-chip communication and off-chip memory access in an energy efficient manner.
Modem microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallelalgorithms for manycore chips like the GPU can present interesting challenges, parti...
详细信息
ISBN:
(纸本)9781605581156
Modem microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallelalgorithms for manycore chips like the GPU can present interesting challenges, particularly for computations on sparse data structures. One particularly common example is the collection of sparse matrix solvers and combinatorial graph algorithms that form the core of many physical simulation techniques. Although seemingly irregular, these operations can often be implemented with dataparallel operations that map very well to massively parallel processors.
Modern microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallelalgorithms for manycore chips like the GPU can present interesting challenges, part...
详细信息
ISBN:
(纸本)9781605581156
Modern microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallelalgorithms for manycore chips like the GPU can present interesting challenges, particularly for computations on sparse data structures. One particularly common example is the collection of sparse matrix solvers and combinatorial graph algorithms that form the core of many physical simulation techniques. Although seemingly irregular, these operations can often be implemented with dataparallel operations that map very well to massively parallel processors.
暂无评论