clustering plays an important role in data mining and machine learning. Then, intuitionistic fuzzy sets (IFSs) are flexible and practical in dealing with vagueness and uncertainty problems. To cluster the information ...
详细信息
clustering plays an important role in data mining and machine learning. Then, intuitionistic fuzzy sets (IFSs) are flexible and practical in dealing with vagueness and uncertainty problems. To cluster the information expressed by intuitionistic fuzzy data, this paper proposes the joint training auto-encoder based intuitionistic fuzzy clustering algorithm. Firstly, we propose the auto-encoder based intuitionistic fuzzy clustering by utilizing similarity measure of IFSs, auto-encoder and k-means algorithm. Then, we propose the joint training auto-encoder based intuitionistic fuzzy clustering algorithm by utilizing the proposed auto-encoder based intuitionistic fuzzy clustering and two kinds of similarity measures for the clustering analysis of intuitionistic fuzzy data. Lastly, several experiments are provided to verify the effectiveness of the proposed intuitionistic fuzzy clustering algorithms.
This study examined the electrocardiographic data set recorded by Boston's Beth Israel Hospital for the work of the cardiac and neurotransmitter system and for normal and various irregular heartbeat patterns of el...
详细信息
This study examined the electrocardiographic data set recorded by Boston's Beth Israel Hospital for the work of the cardiac and neurotransmitter system and for normal and various irregular heartbeat patterns of electrical activity in the heart. Seven different types of arrhythmia, available in this data set, were classified using four different, widely used, classifiers (Fuzzy C-Means, Naive Bayes, Extreme Learning Machine and K-Means) by multiple classification methods. Classifier performances were evaluated using accuracy, sensitivity, and selectivity classification performance measures. The results of the study showed that classification achievements for four classifiers had the highest success rate of 99% of "Normal" beat type compared to other types of arrhythmia. The average classification performances of Naive Bayes and Extreme Learning Machine classifiers were found to be higher when the classifiers were compared among themselves. When the averages of all the arrhythmia types were taken the most successful classifier was detected as Naive Bayes classifier with 92% accuracy and 95% selectivity values.
Data clustering techniques is used for aiding knowledge discovery when no additional information is available. There are several clustering techniques which produce reasonable results, although they often produce qual...
详细信息
Data clustering techniques is used for aiding knowledge discovery when no additional information is available. There are several clustering techniques which produce reasonable results, although they often produce qualitatively distinct clusterings. In this paper, we study how different clustering algorithms produce different kinds of clusters and their relations. Also, we evaluate the possibility to merge differently generated clustering into a new clustering which neither of original algorithms can produce. The main contribution of this paper is a new algorithm which merges previous generated clusterings based on must-link constraint rules built from agreements among elements observed from such clusterings. This novel approach employs the entropy of agreements in order to decide to which cluster should an element belong. Experimental results indicate: 1) our approach can merge characteristics from original clusterings; 2) in some situations, it captures new information from data and improve results, mainly when considering external perspective; and 3) in no situation it has produced significantly worse results.
We give the motivation for scoring clustering algorithms and a metric M: A → N from the set of clustering algorithms to the natural numbers which we realize as (Equation Presented) where αi, βi, wi are parameters u...
详细信息
Big Data has become commonplace in most Internet-based applications, which by delivering services to planetary scale numbers of users generate very large data sets. Such data sets are considered as a valuable source o...
详细信息
ISBN:
(纸本)9781509060306
Big Data has become commonplace in most Internet-based applications, which by delivering services to planetary scale numbers of users generate very large data sets. Such data sets are considered as a valuable source of analytics information and knowledge for many purposes and domains. It is claimed each time more that Big Data and machine learning, especially data mining, are the basis for developing advanced analytics platforms for turning data into valuable assets, gaining competitive advantage and make better decisions. At the same time, however, Big Data applications are showing to be killer applications for the state of the art machine learning and data mining algorithms. Indeed, traditional data mining frameworks such as WEKA, R, etc. and those from big companies such as IBM SPSS Modeler, SAS Enterprise Miner, Oracle Data Mining, etc. are facing the challenges of 1) coping with mining large data sets within short times and 2) under high rates of data generation. The way envisaged ahead to effectively deal with such challenges is to move to Cloudbased versions of such frameworks and development of new frameworks implemented using Cloud platforms. In either case, data mining and machine learning algorithms are being fully implemented in Cloud platforms under new requirements of Big Data for efficiency and performance. In the group of newly developed frameworks there is Apache Mahout, whose goal is “to build an environmentfor quickly creating scalable performant machine learning applications". In this paper we analyse the performance of some clustering algorithms of Apache Mahout using a Twitter streaming dataset under a Hadoop MapReduce cluster infrastructure according to various evaluation criteria.
The problem of estimating appropriate number of clusters has been a main and difficult issue in clustering researches. There are different methods for this in hierarchical clustering; a typical approach is to try clus...
详细信息
ISBN:
(纸本)9781509049189
The problem of estimating appropriate number of clusters has been a main and difficult issue in clustering researches. There are different methods for this in hierarchical clustering; a typical approach is to try clustering for different number of clusters, and compare them using a measure to estimate cluster numbers. On the other hand, there is no such method to estimate automatically the number of clusters in agglomerative hierarchical clustering (AHC), since AHC produces a family of clusters with different cluster numbers at the same time using the form of dendrograms. An exception is the Newman method in network clustering, but this method does not have a useful dendrogram output. The aim of the present paper is to propose new methods to automatically estimate the number of clusters in AHC. We show two approaches for this purpose, one is to use a variation of cluster validity measure, and another is to use statistical model selection method like BIC.
K-means is the basic algorithm used for discovering clusters within a dataset. Methods to enhance the k-means clustering algorithm are discussed. With the help of these methods efficiency, accuracy, performance and co...
详细信息
ISBN:
(纸本)9781538642061
K-means is the basic algorithm used for discovering clusters within a dataset. Methods to enhance the k-means clustering algorithm are discussed. With the help of these methods efficiency, accuracy, performance and computational time are improved. Some enhanced variations improve the efficiency and accuracy of the algorithm. Basically, in all the methods, the main aim is to reduce the number of iterations which will decrease the computational time. Studies show that K-means algorithm in clustering is widely used technique. Various enhancements done on k-mean are collected, so by using these enhancements, one can build a new hybrid algorithm which will be more efficient, accurate and less time consuming than the previous work.
In distributed storage systems, documents are shared among multiple Cloud providers and stored within their respective storage servers. In social secret sharing-based distributed storage systems, shares of the documen...
详细信息
ISBN:
(纸本)9781538624883
In distributed storage systems, documents are shared among multiple Cloud providers and stored within their respective storage servers. In social secret sharing-based distributed storage systems, shares of the documents are allocated according to the trustworthiness of the storage servers. This paper proposes a trust mechanism using machine learning techniques to compute evidence-based trust values. Our mechanism mitigates the effect of colluding storage servers. More precisely, it becomes possible to detect unreliable evidence and establish countermeasures in order to discourage the collusion of storage servers. Furthermore, this trust mechanism is applied to the social secret sharing protocol AS 3 , showing that this new evidence-based trust mechanism enhances the protection of the stored documents.
In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches ...
详细信息
The main goal of clustering algorithms is to organize a given set of data patterns into groups (clusters) and their main strategy is to group patterns based on their similarity. However, some clustering algorithms als...
详细信息
ISBN:
(纸本)9781509035663
The main goal of clustering algorithms is to organize a given set of data patterns into groups (clusters) and their main strategy is to group patterns based on their similarity. However, some clustering algorithms also require as an input parameter, the number of clusters the induced clustering should have, or then, a threshold value used for limiting for the number of induced clusters. Both, the number of cluster as well a threshold value are often unknown;however it is well-known that results of clustering tasks can be very sensitive to them. This work presents a method for empirically estimating both values. The method is based on multiple runs of sequential clustering algorithms, by using increasing threshold values. Results from experiments conducted using several data domains from two repositories, the UCI and the Keel, as well as a few artificially created data, are presented and a comparative analysis is carried out, as evidence of the good estimates on both values given by the method.
暂无评论