In this paper we present and evaluate Inhambu, a distributed object-oriented system that supports the execution of datamining applications on clusters of PCs and workstations. This system provides a resource manageme...
详细信息
In this paper we present and evaluate Inhambu, a distributed object-oriented system that supports the execution of datamining applications on clusters of PCs and workstations. This system provides a resource management layer, built on the top of Java/RMI, that supports the execution of the datamining tool called Weka. We evaluate the performance of Inhambu by means of several experiments in homogeneous, heterogeneous and non-dedicated clusters. The obtained results are compared with those achieved by a similar system named Weka-parallel. Inhambu outperforms its counterpart for coarse grain applications, mainly for heterogeneous and non-dedicated clusters. Also, our system provides additional advantages such as application checkpointing, support for dynamic aggregation of hosts to the cluster, automatic restarting of failed tasks, and a more effective usage of the cluster. Therefore, Inhambu is a promising tool for efficiently executing real-world datamining applications. The software is delivered at the project's web site available at http://***/projects/inhambu/. (c) 2006 Elsevier Inc. All rights reserved.
There is an increasing interest in the field of parallel and distributed data mining in grid environment over the past decade. As an important branch of spatial datamining, spatial outlier mining can be used to find ...
详细信息
There is an increasing interest in the field of parallel and distributed data mining in grid environment over the past decade. As an important branch of spatial datamining, spatial outlier mining can be used to find out some interesting and unexpected spatial patterns in many applications. In this paper, a new parallel & distributed spatial outlier mining algorithm (PD-SOM) is proposed to simultaneously detect global and local outliers in a grid environment. PD-SOM is a Delaunay triangulation (D-TIN) based approach, which was encapsulated and deployed in a distributed platform to provide parallel and distributed spatial outlier mining service. Subsequently, a distributed system framework for PD-SOM is designed on top of a geographical knowledge service grid (GeoKSGrid) developed by our research group, a two-step strategy for spatial outlier detection is put forward to support the encapsulation and distributed deployment of the geographical knowledge service, and two key techniques of the geographical knowledge service: parallel and distributed computing of Delaunay triangulation and the implementation of PD-SOM algorithm are discussed. Finally, the efficiency of the spatial outlier mining service is analyzed in theory, the practicality is confirmed by a demonstrative application on the abnormality analyzing of soil geochemical investigation samples from Fujian eastern coastal zone area in China, and the effectiveness and superiority of PD-SOM in a balanced, scalable grid environment are verified through the comparison with the popular spatial outlier mining algorithm SLOM, for the involvement of large amount of computing cores.
Nowadays society confronts to a huge volume of information which has to be transformed into knowledge. One of the most relevant aspect of the knowledge extraction is the detection of outliers. Numerous algorithms have...
详细信息
ISBN:
(纸本)9783319076171;9783319076164
Nowadays society confronts to a huge volume of information which has to be transformed into knowledge. One of the most relevant aspect of the knowledge extraction is the detection of outliers. Numerous algorithms have been proposed with this purpose. However, not all of them are suitable to deal with very large data sets. In this work, a new approach aimed to detect outliers in very large data sets with a limited execution time is presented. This algorithm visualizes the tuples as N-dimensional particles able to create a potential well around them. Later, the potential created by all the particles is used to discriminate the outliers from the objects composing clusters. Besides, the capacity to be parallelized has been a key point in the design of this algorithm. In this proof-of-concept, the algorithm is tested by using sequential and parallel implementations. The results demonstrate that the algorithm is able to process large data sets with an affordable execution time, so that it overcomes the curse of dimensionality.
Several datamining and machine learning problems can be reduced to the computational geometry problem of finding intersections of a set of geometric objects, such as intersections of line segments or rectangles/boxes...
详细信息
ISBN:
(纸本)9781728182063
Several datamining and machine learning problems can be reduced to the computational geometry problem of finding intersections of a set of geometric objects, such as intersections of line segments or rectangles/boxes. Currently, the state-of-the-art approach for addressing such intersection problems in Euclidean space is collectively known as the sweep-line or plane sweep algorithm, and has been utilized in a variety of application domains, including databases, gaming and transportation, to name a few. The idea behind sweep line is to employ a conceptual line that is swept or moved across the plane, stopping at intersection points. However, to report all K intersections among any N objects, the standard sweep line algorithm (based on the Bentley-Ottmann algorithm) has a time complexity of O((N + K)logN), therefore cannot scale to very large number of objects and cases where there are many intersections. In this paper, we propose MRSWEEP and MRSWEEP-D, two sophisticated and highly scalable algorithms for the parallelization of sweep-line and its variants. We provide algorithmic details of fully distributed in-memory versions of the proposed algorithms using the MapReduce programming paradigm in the Apache Spark cluster environment. A theoretical analysis of the proposed algorithms is presented, as well as a thorough experimental evaluation that provides evidence of the algorithms' scalability in varying levels of problem complexity. We make source code and datasets available to support the reproducibility of the results.
暂无评论