This paper presents a method for distributed multivariate regression using wavelet-based collective datamining (CDM). The method seamlessly blends machine learning and the theory of communication with the statistical...
详细信息
This paper presents a method for distributed multivariate regression using wavelet-based collective datamining (CDM). The method seamlessly blends machine learning and the theory of communication with the statistical methods employed in parametric multivariate regression to provide an effective datamining technique for use in a distributeddata and computation environment. The technique is applied to two benchmark data sets, producing results that are consistent with those obtained by applying standard parametric regression techniques to centralized data sets. Evaluation of the method in terms of mode accuracy as a function of appropriateness of the selected wavelet function, relative number of nonlinear cross-terms. and sample size demonstrates that accurate parametric multivariate regression models call be generated from distributed, heterogeneous, data sets with minimal data communication overhead compared to that required to aggregate a distributeddata set. Application of this method to linear discriminant analysis, which is related Co parametric multivariate regression, produced classification results on the Iris data set that are comparable to those obtained with centralized data analysis. (C) 2001 Academic Press.
In this paper, we propose a new algorithm, named Grid-based distributed Max-Miner (GridDMM), for mining maximal frequent itemsets from databases on a data Grid. A frequent itemset is maximal if none of its supersets i...
详细信息
In this paper, we propose a new algorithm, named Grid-based distributed Max-Miner (GridDMM), for mining maximal frequent itemsets from databases on a data Grid. A frequent itemset is maximal if none of its supersets is frequent. GridDMM is specifically suitable for use in Grid environments due to low communication and synchronization overhead. GridDMM consists of a local mining phase and a global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix-tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. We built a data Grid system on a cluster of workstations using the open-source Globus Toolkit, and evaluated the GridDMM algorithm in terms of performance, scalability, and the overhead of communication and synchronization. GridDMM demonstrates better performance than other sequential and parallel algorithms, and its performance is scalable in terms of the database size and the number of nodes.
We present a collective approach to learning a Bayesian network from distributed heterogeneous data. In this approach, we first learn a local Bayesian network at each site using the local data. Then each site identifi...
详细信息
We present a collective approach to learning a Bayesian network from distributed heterogeneous data. In this approach, we first learn a local Bayesian network at each site using the local data. Then each site identifies the observations that are most likely to be evidence of coupling between local and non-local variables and transmits a subset of these observations to a central site. Another Bayesian network is learnt at the central site using the data transmitted from the local site. The local and central Bayesian networks are combined to obtain a collective Bayesian network, which models the entire data. Experimental results and theoretical justification that demonstrate the feasibility of our approach are presented.
In this paper, we propose an agent-based approach to mine association rules from data sets that are distributed across multiple locations while preserving the privacy of local data. This approach relies on the local s...
详细信息
ISBN:
(纸本)9789896740061
In this paper, we propose an agent-based approach to mine association rules from data sets that are distributed across multiple locations while preserving the privacy of local data. This approach relies on the local systems to find frequent itemsets that are encrypted and the partial results are carried from site to site. In this way, the privacy of local data is preserved. We present a structural model that includes several types of mobile agents with specific functionalities and communication scheme to accomplish the task. These agents implement the privacy-preserving algorithms for distributed association rule mining.
The article describes extension of lambda-calculation for creation of parallel datamining algorithms. The proposed approach uses presentation of the algorithm as a consequence of pure functions with unified interface...
详细信息
ISBN:
(纸本)9783319219097;9783319219080
The article describes extension of lambda-calculation for creation of parallel datamining algorithms. The proposed approach uses presentation of the algorithm as a consequence of pure functions with unified interfaces. For parallel execution we use special function that allows to change a structure of the algorithm and to implement various strategies for processing of data set and model.
Genetic programming is a powerful search method which can be applied to the typical datamining task of finding hidden relations in datasets. We describe the architecture of a distributed data mining system in which g...
详细信息
ISBN:
(纸本)9788890372407
Genetic programming is a powerful search method which can be applied to the typical datamining task of finding hidden relations in datasets. We describe the architecture of a distributed data mining system in which genetic programming agents create a large amount of structurally different models which are stored in a model database. A search engine for models that is connected to this database allows interactive exploration and analysis of models, and composition of simple models to hierarchical models. The search engine is the crucial component of the system in the sense that it supports knowledge discovery and paves the way for the goal of finding interesting hidden causal relations.
In today's world, there are number of transactions can be performed on social media. In such distributed environment where timely accessing of data is important, it becomes difficult to generate strong association...
详细信息
ISBN:
(纸本)9781509020805
In today's world, there are number of transactions can be performed on social media. In such distributed environment where timely accessing of data is important, it becomes difficult to generate strong association rules. So it is necessary to reduce these rules for increasing rule reduction rate. This paper uses w-Tabular algorithm which combines weight assignment method and Quine-Mccluskey method which increases data processing time in distributed system.
this paper describes an approach of data preparation for a datamining algorithms application. The approach integrates ETL tools for distributed heterogeneous data extraction and transformation and the DXelopes librar...
详细信息
ISBN:
(纸本)9781467369619
this paper describes an approach of data preparation for a datamining algorithms application. The approach integrates ETL tools for distributed heterogeneous data extraction and transformation and the DXelopes library for a datamining algorithms application. The paper also describes the implementation of this approach.
Privacy is one of the most important properties of ail information system must satisfy. In which systems the need to share information among different, not trusted entities, the protection of sensible information has ...
详细信息
ISBN:
(纸本)9783540855644
Privacy is one of the most important properties of ail information system must satisfy. In which systems the need to share information among different, not trusted entities, the protection of sensible information has a relevant role. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used in a Malicious way. Privacy preserving datamining algorithms have been recently introduced with the aim of preventing the discovery of sensible information. In this paper we propose a modification to privacy preserving association rule mining oil distributed homogenous database algorithm. Our algorithm is faster than old one which modified with preserving privacy and accurate results. Modified algorithm is based on a semi-honest model with negligible collision probability. The flexibility to extend to any number of sites without any change in implementation call be achieved. And also any increase doesn't add more time to algorithm because all client sites perform the mining in the same time so the overhead in communication time only. The total bit-communication cost for our algorithm is function in (N) sites.
datamining often is a compute intensive and time requiring process. For this reason, several datamining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of la...
详细信息
ISBN:
(纸本)9783540725299
datamining often is a compute intensive and time requiring process. For this reason, several datamining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of large data sets. Moreover, when large data repositories are coupled with geographical distribution of data, users and systems, more sophisticated technologies are needed to implement high-performance distributed KDD systems. Recently computational Grids emerged as privileged platforms for distributed computing and a growing number of Grid-based KDD systems have been designed. In this paper we first outline different ways to exploit parallelism in the main datamining techniques and algorithms, then we discuss Grid-based KDD systems.
暂无评论