How to apply clustering algorithm to effectively cluster large-scale data is an important research topic in data mining. Based on an in-depth analysis of the Hadoop platform architecture and canopy-kmeans clustering a...
详细信息
ISBN:
(纸本)9781450363525
How to apply clustering algorithm to effectively cluster large-scale data is an important research topic in data mining. Based on an in-depth analysis of the Hadoop platform architecture and canopy-kmeans clustering algorithm, the canopy-kmeans algorithm was optimized and parallelized. The data packets are clustered after grouping and sampling by statistical thinking to facilitate parallelization and reduce time complexity. The canopy initial center point selection was optimized using the minimum-maximum principle, and data outlier average sampling method was used to ensure the uniform extraction of data samples from the original data, and the k-means iterative calculation process was optimized. Combined with the MapReduce framework under the Hadoop platform, the improved algorithm is designed and implemented in parallel. Experiments show that the improved canopy-kmeans parallel algorithm is effective and convergent when clustering massive amounts of numerical data, and it has a certain degree of improvement in the clustering accuracy and timeliness.
Aiming at the existing user abnormal electricity consumption detection methods that have the problem of difficult classification of user similar electricity consumption patterns, this paper proposes an unsupervised is...
详细信息
Aiming at the existing user abnormal electricity consumption detection methods that have the problem of difficult classification of user similar electricity consumption patterns, this paper proposes an unsupervised isolation forest abnormal electricity consumption detection model based on the canopy-kmeans algorithm with weighted density improvement. To start, we propose a composite parameter analysis method for user electricity consumption patterns, volatility, trends, and correlations using Irish smart meter data. This method involves joint data cleaning, interpolation, and feature construction. Additionally, principal component analysis is introduced to fuse features across layers and reduce dimensionality in user electricity consumption. Subsequently, we introduce the weighted density improvement canopy-kmeans clustering algorithm. This algorithm determines the K value and clustering centers using the maximum weight product method, based on definitions of sample density, average intra-class sample distance, and inter-class distance in the multilayer fusion feature data. Finally, we propose a fusion mechanism of weighted density improvement canopy-kmeans and isolation forest algorithms to jointly construct a model for detecting abnormal power usage based on multilayer fusion feature data analysis. The results demonstrate that multilayer fusion feature parameters vary in size and discretization among different user types, enabling classification of users with diverse electricity consumption patterns. Moreover, the anomaly detection model based on multilayer fusion feature data analysis improves accuracy rates, recall rates, and F1 scores compared to other algorithms.
暂无评论