Clustering, as an unsupervised learning method and an important process in datamining, is an aspect of large and distributeddata analysis. In many applications, such as peer-to-peer systems, huge volumes of data are...
详细信息
Clustering, as an unsupervised learning method and an important process in datamining, is an aspect of large and distributeddata analysis. In many applications, such as peer-to-peer systems, huge volumes of data are distributed between multiple sources. Analysis of these volumes of data and identifying appropriate clusters is challenging due to transmission, processing and storage costs. In this paper, a gossip-based distributed clustering algorithm for P2P networks called Efficient GBDC-P2P is proposed, based on an improved gossip communicative approach by combining the peer sampeling and CYCLON protocol and the idea of partitioning-based data clustering. This algorithm is appropriate for data clustering in unstructured P2P networks, and it is adapted to the dynamic conditions of these networks. In the Efficient GBDC-P2P algorithm, distributed peers perform clustering operation in a distributed way only through local communications with their neighbors. Our approach does not rely on the central server to carry out data clustering task and without the need to synchronize operations. Evaluation results verify the efficiency of our proposed algorithm for data clustering in unstructured P2P networks. Furthermore, comparative analyses with other well-established distributed clustering approaches demonstrate the superior accuracy of the proposed method.
Spam appears in various forms and the current trend in spamming is moving towards multimedia spam objects. Image spam is a new type of spam attacks which attempts to bypass the spam filters that mostly text-based. Spa...
详细信息
Spam appears in various forms and the current trend in spamming is moving towards multimedia spam objects. Image spam is a new type of spam attacks which attempts to bypass the spam filters that mostly text-based. Spamming attacks the users in many ways and these are usually countered by having a server to filter the spammers. This paper provides a fully-distributed pattern recognition system within P2P networks using the distributed associative memory tree (DASMET) algorithm to detect spam which is cost-efficient and not prone to a single point of failure, unlike the server-based systems. This algorithm is scalable for large and frequently updated data sets, and specifically designed for data sets that consist of similar occurring *** have evaluated our system against centralised state-of-the-art algorithms (NN, k-NN, naive Bayes, BPNN and RBFN) and distributed P2P-based algorithms (Ivote-DPV, ensemble k-NN, ensemble naive Bayes, and P2P-GN). The experimental results show that our method is highly accurate with a 98 to 99% accuracy rate, and incurs a small number of messages-in the best-case, it requires only two messages per recall test. In summary, our experimental results show that the DAS-MET performs best with a relatively small amount of resources for the spam detection compared to other distributed methods.
The computational complexity, huge memory space requirement, and time-consuming nature of frequent pattern mining process are the most important motivations for distribution and parallelization of this mining process....
详细信息
The computational complexity, huge memory space requirement, and time-consuming nature of frequent pattern mining process are the most important motivations for distribution and parallelization of this mining process. On the other hand, the emergence of distributed computational and operational environments, which causes the production and maintenance of data on different distributeddata sources, makes the parallelization and distribution of the knowledge discovery process inevitable. In this paper, a gossip based distributed itemset mining (GDIM) algorithm is proposed to extract frequent itemsets, which are special types of frequent patterns, in a wireless sensor network environment. In this algorithm, local frequent itemsets of each sensor are extracted using a bit-wise horizontal approach (LHPM) from the nodes which are clustered using a leach-based protocol. Heads of clusters exploit a gossip based protocol in order to communicate each other to find the patterns which their global support is equal to or more than the specified support threshold. Experimental results show that the proposed algorithm outperforms the best existing gossip based algorithm in term of execution time.
A model-based co-clustering divides the data based on two main axes and simultaneously trains a supervised model for each co-cluster using all other input features. For example, in the rating prediction task of recomm...
详细信息
A model-based co-clustering divides the data based on two main axes and simultaneously trains a supervised model for each co-cluster using all other input features. For example, in the rating prediction task of recommender system, the main two axes are items and users. In each co-cluster, we train a regression model for predicting the rating based on other features such as user's characteristics (e.g., gender), item's characteristics (e.g., genre), contextual features (e.g., location), and so on. In reality, users and items do not necessarily belong to a single co-cluster, but rather can be associated with several co-clusters. We extend the model-based co-clustering to support fuzzy co-clustering. In this setting, each item-user pair is associated to every co-cluster with some membership grade. This grade indicates the level of relevance of the item-user pair to the co-cluster. Furthermore, we propose a distributed algorithm, based on a map-reduce approach, to handle big datasets. Evaluating the fuzzy co-clustering algorithm on three datasets shows a significant improvement comparing with a regular co-clustering algorithm. In addition, a map-reduce version of the fuzzy co-clustering algorithm significantly reduces the runtime.
Clustering is one of the important datamining issues, especially for large and distributeddata analysis. distributed computing environments such as Peer-to-Peer (P2P) networks involve separated/scattered data source...
详细信息
Clustering is one of the important datamining issues, especially for large and distributeddata analysis. distributed computing environments such as Peer-to-Peer (P2P) networks involve separated/scattered data sources, distributed among the peers. According to unpredictable growth and dynamic nature of P2P networks, data of peers are constantly changing. Due to the high volume of computing and communications and privacy concerns, processing of these types of data should be applied in a distributed way and without central management. Today, most applications of P2P systems focus on unstructured P2P systems. In unstructured P2P networks, spreading gossip is a simple and efficient method of communication, which can adapt to dynamic conditions in these networks. Recently, some algorithms with different pros and cons have been proposed for data clustering in P2P networks. In this paper, by combining a novel method for extracting the representative data, a gossip-based protocol and a new centralized clustering method, a Gossip Based distributed Clustering algorithm for P2P networks called GBDC-P2P is proposed. The GBDC-P2P algorithm is suitable for data clustering in unstructured P2P networks and it adapts to the dynamic conditions of these networks. In the GBDC-P2P algorithm, peers perform data clustering operation with a distributed approach only through communications with their neighbours. The GBDC-P2P does not need to rely on a central server and it performs asynchronously. Evaluation results demonstrate the superior performance of the GBDC-P2P algorithm. Also, a comparative analysis with other well-established methods illustrates the efficiency of the proposed method. (C) 2016 Elsevier B. V. All rights reserved.
In today's world, there are number of transactions can be performed on social media. In such distributed environment where timely accessing of data is important, it becomes difficult to generate strong association...
详细信息
ISBN:
(纸本)9781509020805
In today's world, there are number of transactions can be performed on social media. In such distributed environment where timely accessing of data is important, it becomes difficult to generate strong association rules. So it is necessary to reduce these rules for increasing rule reduction rate. This paper uses w-Tabular algorithm which combines weight assignment method and Quine-Mccluskey method which increases data processing time in distributed system.
the article describes the mapping of the algorithm decomposed into functional blocks on a distributed execution environment. In addition, it describes the architecture and implementation of service to perform data min...
详细信息
ISBN:
(纸本)9781509004454
the article describes the mapping of the algorithm decomposed into functional blocks on a distributed execution environment. In addition, it describes the architecture and implementation of service to perform datamining algorithms in that environment. As an example, it describes the implementation and experiments with classification algorithm - 1R.
In the internet-based e-business environment, most business data are distributed, heterogeneous and private. To achieve true business intelligence, mining large amounts of distributeddata is necessary. Through a thor...
详细信息
In the internet-based e-business environment, most business data are distributed, heterogeneous and private. To achieve true business intelligence, mining large amounts of distributeddata is necessary. Through a thorough literature review, this paper identifies four main issues in distributed data mining (DDM) systems for e-business and classifies modern DDM systems into three classes with representative samples. To address these identified issues, this paper proposes a novel DDM model named DRHPDM (data source Relevance-based Hierarchical Parallel distributed data mining Model). In addition, to improve the quality of the final result, the data sources are divided into a centralized mining layer and a distributedmining layer, according to their relevance. To improve the openness, cross-platform ability, and intelligence of the DDM system, web service and multi-agent technologies are adopted. The feasibility of DRHPDM was verified by building a prototype system and applying it to a web usage mining scenario.
The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional datamining and machine learning do not have as a whole. Therefore, new data analytics frameworks are n...
详细信息
ISBN:
(纸本)9789811302923;9789811302916
The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional datamining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has superlinear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications.
The paper describes necessary and sufficient conditions for parallel execution of functions in datamining algorithms. The said conditions take into account data connections between functions based on a variety of usa...
详细信息
ISBN:
(纸本)9781538643402
The paper describes necessary and sufficient conditions for parallel execution of functions in datamining algorithms. The said conditions take into account data connections between functions based on a variety of usable and modifiable mining model's elements. We determine the conditions for parallel execution in computing environments with distributed and shared memory. As an example, we describe the determination of the conditions for parallel execution of Naive Bayes classifier functions.
暂无评论