Data clustering is a very useful data mining technique to find groups of similar objects present in the dataset. Scalability to handle immense volumes, robustness to intrinsic outlier data and validity of clustering r...
详细信息
Data clustering is a very useful data mining technique to find groups of similar objects present in the dataset. Scalability to handle immense volumes, robustness to intrinsic outlier data and validity of clustering results are the main challenges of any data clustering approach. In order to address these challenges, a fuzzy weighted clustering approach which is comprehensibly parallel and distributed in every phase, is proposed in this research. Although the proposed method can be used for various data clustering purposes, it has been applied in gene expression clustering to reveal functional relationships of genes in a biological process. Conforming to MapReduce, the proposed method also presents a novel similarity measure which benefits from combining ordered weighted averaging and Spearman correlation coefficient. In the proposed method, density reachable genes were joined to establish subclusters. Afterwards, final cluster results were obtained by merging these subclusters. A voting system detects the best weights and consequently the most valid clusters among all possible results for each distinct dataset. The whole algorithm is implemented on a distributed processing platform and it is scalable to process any size of data stored in cloud infrastructures. Precision of resulting clusters were evaluated using some of the well-known cluster validity indexes in the literature. Also, the efficiency of the proposed method in scalability and robustness was compared with recently published similar researches. In all the mentioned comparisons, the proposed method outperformed recent works on the same datasets. (C) 2017 Elsevier Ltd. All rights reserved.
暂无评论