Educational Data Mining (EDM) is the application of data mining methods in the educational domain. In the EDM field, we see mixed data (i.e., text and number data types). Grouping or clustering such data is challengin...
详细信息
Educational Data Mining (EDM) is the application of data mining methods in the educational domain. In the EDM field, we see mixed data (i.e., text and number data types). Grouping or clustering such data is challenging because determining the similarity between mixed data is poorly defined. Existing partition clustering algorithms for handling such data are based on two approaches: conversion of data types, where all data variables are converted to a single data type, and a mixed one, where the similarity measures of different data types are merged by either using a weighted sum approach as in Gower's distance or by using mixed dissimilarity function as used in the k-Medoids algorithm to define a singular similarity measure for mixed data. Such a datatype conversion causes information loss, and this aspect is not discussed in the existing research literature. This study systematically reviews the past fifty-three years i.e. from 1971 to 2024 of research works on partition clustering algorithms applied to mixed data in EDM. A review of 104 research articles noted that most partitional clustering algorithms have continuous or categorical variables but not mixed variables. Researchers and practitioners often cite the lack of continuous and categorical variables analysis methods. Therefore, developing machine learning algorithms that can handle mixed data inherently present in the educational domain is increasingly becoming important. In addition to comparative analysis and analysis based on several factors, research gaps are also identified and mentioned in this article, and future insights are outlined.
clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of dat...
详细信息
clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.
Nowadays, a great interest to upgrade the existing power grid to become smart grid (SG) has been put by both, the research and the industrial community. More specifically, smart metering and communications method are ...
详细信息
Nowadays, a great interest to upgrade the existing power grid to become smart grid (SG) has been put by both, the research and the industrial community. More specifically, smart metering and communications method are recently and extensively getting studied in SG. However, the design and development of an efficient routing protocol in Radio Frequency (RF) mesh network to connect the advanced metering infrastructure (AMI) to collectors and vice versa highly depends on the positions of the routers. In this spirit, we focus our work in this paper to optimize the positions of the available routers to bring out the highest possible connectivity between smart meters and collectors. To do so, we have used two well-known clustering algorithms, the maximum distance to average vector (MDAV) and the Lloyd algorithm, to allocate routers in their optimized positions in a smart grid scenario. An extensive simulations have been carried out with the proposed algorithms, where significant improvement has been shown with respect to the initial distribution of routers. (C) 2019 Elsevier B.V. All rights reserved.
The development of the phase-imaging ion-cyclotron resonance (PI-ICR) technique for use in Penning trap mass spectrometry (PTMS) increased the speed and precision with which PTMS experiments can be carried out. In PI-...
详细信息
The development of the phase-imaging ion-cyclotron resonance (PI-ICR) technique for use in Penning trap mass spectrometry (PTMS) increased the speed and precision with which PTMS experiments can be carried out. In PI-ICR, data sets of the locations of individual ion hits on a detector are created showing how ions cluster together into spots according to their cyclotron frequency. Ideal data sets would consist of a single, 2D-spherical spot with no other noise, but in practice data sets typically contain multiple spots, non-spherical spots, or significant noise, all of which can make determining the locations of spot centers non-trivial. A method for assigning groups of ions to their respective spots and determining the spot centers is therefore essential for further improving precision and confidence in PI-ICR experiments. We present the class of Gaussian mixture model (GMM) clustering algorithms as an optimal solution. We show that on simulated PI-ICR data, several types of GMM clustering algorithms perform better than other clustering algorithms over a variety of typical scenarios encountered in PI-ICR. The mass spectra of 163Gd, 163 "'Gd, 162Tb, and 162 "'Tb measured using PI-ICR at the Canadian Penning trap mass spectrometer were checked using GMMs, producing results that were in close agreement with the previously published values.
Induction heating (IH) devices transfer the electric power to the contactless cookware via the electromagnetic field. Therefore, the temperature of cookware is measured remotely, and the early detection of cookware ov...
详细信息
Induction heating (IH) devices transfer the electric power to the contactless cookware via the electromagnetic field. Therefore, the temperature of cookware is measured remotely, and the early detection of cookware overheating will ensure the user's safety as well as extend the remaining useful life of electronic components. Therefore, this work presents a clustering model for outlier detection in IH systems based on clustering algorithms and measured data using two thermal sensors. First, a healthy dataset is collected for the temperatures of inverters and cookware under different sizes and materials of cookware items, different amounts of water in cookware, and different amounts of electrical power. After that, K-means and fuzzy c-means were utilized to cluster this normal dataset, where the maximum distance between their centers and data points was selected as a threshold. Finally, the clustered model is investigated using a testing dataset that includes outliers. According to the results, the K-means algorithm detected around 96% of the produced outliers, however, the fuzzy c-means algorithm detected around 68%. In conclusion, the deployment of the clustering model in outlier detection is simple and uses only the threshold and the cluster centers.
In many disciplines, the evaluation of algorithms for processing massive data is a challenging research issue. However, different algorithms can produce different or even conflicting evaluation performance, and this p...
详细信息
In many disciplines, the evaluation of algorithms for processing massive data is a challenging research issue. However, different algorithms can produce different or even conflicting evaluation performance, and this phenomenon has not been fully investigated. The motivation of this paper aims to propose a solution scheme for the evaluation of clustering algorithms to reconcile different or even conflicting evaluation performance. The goal of this research is to propose and develop a model, called decision-making support for evaluation of clustering algorithms (DMSECA), to evaluate clustering algorithms by merging expert wisdom in order to reconcile differences in their evaluation performance for information fusion during a complex decision-making process. The proposed model is tested and verified by an experimental study using six clustering algorithms, nine external measures, and four MCDM methods on 20 UCI data sets, including a total of 18,310 instances and 313 attributes. The proposed model can generate a list of algorithm priorities to produce an optimal ranking scheme, which can satisfy the decision preferences of all the participants. The results indicate our developed model is an effective tool for selecting the most appropriate clustering algorithms for given data sets. Furthermore, our proposed model can reconcile different or even conflicting evaluation performance to reach a group agreement in a complex decision-making environment.
Density peak clustering (DPC) algorithm has become a well-known clustering method during the last decade, The research communities believe that DPC is a powerful tool applied to various fields underlying distinct cont...
详细信息
Density peak clustering (DPC) algorithm has become a well-known clustering method during the last decade, The research communities believe that DPC is a powerful tool applied to various fields underlying distinct contemporary issues and future prospects, it is time to summarize the research progress of DPC and help them quickly know what issues have been resolved, what issues remain open, and what to do in the future. In this survey, we first describe several frequently used synthetic, UCI, and image datasets followed by the reviewing of all the DPC-related works as categorized into: finding clusters with different densities, optimizing parameter values, preventing domino effects, clustering large datasets, implementing parameter-less DPC, clustering mixed data, and clustering imbalanced data. Then, we compare the recently and widely used extensions of DPC based on the 26 synthetic and UCI datasets. Finally, according to the above analysis, the survey concludes with the improvement of DPC on synthetic and UCI datasets, revisiting large-scale data clustering, parameter-less clustering, privacy-protecting based clustering like challenges, proposing solutions on the deployment of DPC in spark, introducing deep clustering to DPC, and finally federating DPC clustering. To the best of our knowledge, this is the first review that summarizes the progress of DPC in the last decade.
The standard driving cycles (DCs) used to evaluate spark-ignition engine-based two-wheelers are inadequate for electric two-wheelers (E2Ws). Also, they fail to accurately represent the actual driving circumstances in ...
详细信息
The standard driving cycles (DCs) used to evaluate spark-ignition engine-based two-wheelers are inadequate for electric two-wheelers (E2Ws). Also, they fail to accurately represent the actual driving circumstances in specific areas, resulting in inaccuracies during the evaluation of performance. The current research is centred towards constructing an electric two-wheeler urban driving cycle (E2WUDC) that considers the driving circumstances of the smart city in India. Further, the denoised speed data is utilized to extract the micro-trips and compute their driving parameters. Furthermore, the dimensions of the data are decreased through the utilization of principal component analysis. Subsequently, the data is classified utilizing various clustering methods including k-means, X-means, hierarchical clustering, and density-based spatial clustering of applications with noise (DBSCAN). Then, the Calinski Harabasz index (CHI), Davies-Bouldin index (DBI), and silhouette score are used to assess the homogeneity and completeness of selected clustering algorithms in the data cluster. Overall, the E2WUDC is developed using X-means which is selected as a suitable clustering algorithm based on the performance indices. Also, the key driving features of E2WUDC such as total time duration and distance are 14.49 km and 1914 seconds with average and maximum driving speeds of 8 and 13.88 m/s respectively. Eventually, it establishes the foundation for assessing the energy economy, driving range and energy demand for the widespread deployment of electric two-wheelers in urban commuting.
In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have u...
详细信息
In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level.
This survey rigorously explores contemporary clustering algorithms within the machine learning paradigm, focusing on fi ve primary methodologies: centroid-based, hierarchical, density-based, distribution-based, and gr...
详细信息
This survey rigorously explores contemporary clustering algorithms within the machine learning paradigm, focusing on fi ve primary methodologies: centroid-based, hierarchical, density-based, distribution-based, and graph-based clustering. Through the lens of recent innovations such as deep embedded clustering and spectral clustering, we analyze the strengths, limitations, and the breadth of application domains-ranging - ranging from bioinformatics to social network analysis. Notably, the survey introduces novel contributions by integrating clustering techniques with dimensionality reduction and proposing advanced ensemble methods to enhance stability and accuracy across varied data structures. This work uniquely synthesizes the latest advancements and offers new perspectives on overcoming traditional challenges like scalability and noise sensitivity, thus providing a comprehensive roadmap for future research and practical applications in data-intensive environments.
暂无评论