Educational Data Mining (EDM) is the application of data mining methods in the educational domain. In the EDM field, we see mixed data (i.e., text and number data types). Grouping or clustering such data is challengin...
详细信息
Educational Data Mining (EDM) is the application of data mining methods in the educational domain. In the EDM field, we see mixed data (i.e., text and number data types). Grouping or clustering such data is challenging because determining the similarity between mixed data is poorly defined. Existing partition clustering algorithms for handling such data are based on two approaches: conversion of data types, where all data variables are converted to a single data type, and a mixed one, where the similarity measures of different data types are merged by either using a weighted sum approach as in Gower's distance or by using mixed dissimilarity function as used in the k-Medoids algorithm to define a singular similarity measure for mixed data. Such a datatype conversion causes information loss, and this aspect is not discussed in the existing research literature. This study systematically reviews the past fifty-three years i.e. from 1971 to 2024 of research works on partition clustering algorithms applied to mixed data in EDM. A review of 104 research articles noted that most partitional clustering algorithms have continuous or categorical variables but not mixed variables. Researchers and practitioners often cite the lack of continuous and categorical variables analysis methods. Therefore, developing machine learning algorithms that can handle mixed data inherently present in the educational domain is increasingly becoming important. In addition to comparative analysis and analysis based on several factors, research gaps are also identified and mentioned in this article, and future insights are outlined.
clustering algorithms are becoming popular and widely applied in many academic fields, such as machine learning, pattern recognition, and artificial intelligence. It has posed significant challenges to accelerate the ...
详细信息
clustering algorithms are becoming popular and widely applied in many academic fields, such as machine learning, pattern recognition, and artificial intelligence. It has posed significant challenges to accelerate the algorithms due to the explosive data scale and wide variety of applications. However, previous studies mainly focus on the raw speedup with insufficient attention to the flexibility of the accelerator to support various applications. In order to accelerate different clustering algorithms in one accelerator, in this article, we design an accelerating framework based on FPGA for four state-of-the-art clustering methods, including K-means, PAM, SLINK, and DBSCAN algorithms. Moreover, we provide both euclidean and Manhattan distances as similarity metrics in the accelerator design paradigm. Moreover, we provide a custom instruction set to operate the accelerators within each application. In order to evaluate the performance and hardware cost of the accelerator, we constructed a hardware prototype on the state-of-the-art Xilinx FPGA platform. Experimental results demonstrate that the accelerator framework is able to achieve up to 23x speedup than Intel Xeon processor, and is 9.46x more energy efficient than NVIDIA GTX 750 GPU accelerators.
The estimation of the aerobic phase end-point is usually used to improve the operating capacity in a sequencing batch reactor. In this paper, a software tool and a configuration of the dissolved oxygen control closed ...
详细信息
The estimation of the aerobic phase end-point is usually used to improve the operating capacity in a sequencing batch reactor. In this paper, a software tool and a configuration of the dissolved oxygen control closed loop are proposed to achieve the aerobic end-point detection of a sequencing batch reactor in a coke wastewater treatment plant. The proposed software tool consists of self-organizing map (SOM) and clustering algorithms. Moreover a validation method for SOM training is outlined and a predefined criterion to determine the SOM size is tested. (c) 2005 Elsevier Ltd. All rights reserved.
One of the essential aspects of broadcast monitoring is to detect and consequently extract commercial blocks in telecast news videos. The research carried out until now have based their work almost entirely on preconc...
详细信息
One of the essential aspects of broadcast monitoring is to detect and consequently extract commercial blocks in telecast news videos. The research carried out until now have based their work almost entirely on preconceived characteristics that are associated with a channel. With the advertisers constantly looking to work around the existing policies, the reliance on the nature of channels during an advertisement does not suffice. The other approach towards identifying a commercial is by frequentist approach. However, it is often the case that sponsored programs and other programs share similar time in any specified hour, rendering the frequentist approach almost useless in the process. As such, this paper uses machine learning based approach which is more generic and can employ inherent differences that commercials have over their non-commercial counterparts for classifying and clustering commercials in the news videos. The datasets which contain 90 hours of recordings from five different news channels from US, England and India have been used to train and test nine different classifiers - K Neighbors, Support Vector Machine, Decision Tree, Random Forests, Ada Boost, Gradient Boost, Gaussian NB, Linear Discriminant Analysis, and Quadratic Discriminant Analysis - and five different clustering algorithms - K Means, Agglomerative, Birch, Mini-Batch K Means, and Gaussian Mixture. Our results show that the Random Forests outperforms all the other classifiers used with respect to F1 score and median time to train and test on each of these datasets that consists of features of shots extracted from 18 hours of video. Similarly, Mini Batch K Means was found to perform the best for forming clusters of news and commercials.
The customized bus operating mode based on passenger demand is an effective way to solve the problem of bus services in low travel density areas such as urban fringe areas, ensure the profitability of bus enterprises,...
详细信息
The customized bus operating mode based on passenger demand is an effective way to solve the problem of bus services in low travel density areas such as urban fringe areas, ensure the profitability of bus enterprises, and promote the development of customized bus and other emerging bus. First, this study introduces the concept and operating principle of customized bus, determines the advantages and disadvantages of customized bus, evaluates the relevant theories of customized bus lines and station planning, and determines the principles of customized bus lines and station planning. Second, according to the characteristics of customized bus, this study proposes a novel customized bus line and station planning method completely based on passenger travel demand, including travel demand data processing, traffic community division, joint station planning, the establishment of a customized bus line planning model, and the solution of the planning model. Finally, the proposed planning method and improved ant colony optimization and clustering are verified by simulation experiments. The experimental results show that the station line planning method proposed in this paper can better realize the line planning of demand-responsive customized bus as well as meet diverse passenger travel needs.
Boundary extraction is a key task in many image analysis operations. This paper describes a class of constrained clustering algorithms for object boundary extraction that includes several well-known algorithms propose...
详细信息
Boundary extraction is a key task in many image analysis operations. This paper describes a class of constrained clustering algorithms for object boundary extraction that includes several well-known algorithms proposed in different fields (deformable models, constrained clustering, data ordering, and traveling salesman problems), The algorithms belonging to this class are obtained by the minimization of a cost function with two terms: a quadratic regularization term and an image-dependent term defined by a set of weighting functions, The minimization of the cost function is achieved by lowpass filtering the previous model shape and by attracting the model units toward the centroids of their attraction regions, To define a new algorithm belonging to this class, the user has to specify a regularization matrix and a set of weighting functions that control the attraction of the model units toward the data, The usefulness of this approach is twofold: It provides a unified framework for many existing algorithms in pattern recognition and deformable models, and allows the design of new recursive schemes.
Three components of a machine cell formation process-similarity coefficients, clustering algorithms, and performance measures-are studied. A new performance measure is introduced and a comparative study of three diffe...
详细信息
Three components of a machine cell formation process-similarity coefficients, clustering algorithms, and performance measures-are studied. A new performance measure is introduced and a comparative study of three different similarity coefficients-the Jaccard's similarity coefficient, weighted similarity coefficient, and commonality score-is conducted.
This paper reports the results of a numerical comparison of two versions of the fuzzy c-means (FCM) clustering algorithms. In particular, we propose and exemplify an approximate fuzzy c-means (AFCM) implementation bas...
详细信息
This paper reports the results of a numerical comparison of two versions of the fuzzy c-means (FCM) clustering algorithms. In particular, we propose and exemplify an approximate fuzzy c-means (AFCM) implementation based upon replacing the necessary ``exact'' variates in the FCM equation with integer-valued or real-valued estimates. This approximation enables AFCM to exploit a lookup table approach for computing Euclidean distances and for exponentiation. The net effect of the proposed implementation is that CPU time during each iteration is reduced to approximately one sixth of the time required for a literal implementation of the algorithm, while apparently preserving the overall quality of terminal clusters produced. The two implementations are tested numerically on a nine-band digital image, and a pseudocode subroutine is given for the convenience of applications-oriented readers. Our results suggest that AFCM may be used to accelerate FCM processing whenever the feature space is comprised of tuples having a finite number of integer-valued coordinates.
Patient stratification has been studied widely to tackle subtype diagnosis problems for effective treatment. Due to the dimensionality curse and poor interpretability of data, there is always a long-lasting challenge ...
详细信息
Patient stratification has been studied widely to tackle subtype diagnosis problems for effective treatment. Due to the dimensionality curse and poor interpretability of data, there is always a long-lasting challenge in constructing a stratification model with high diagnostic ability and good generalization. To address these problems, this article proposes two novel evolutionary multiobjective clustering algorithms with ensemble (NSGA-II-ECFE and MOEA/D-ECFE) with four cluster validity indices used as the objective functions. First, an effective ensemble construction method is developed to enrich the ensemble diversity. After that, an ensemble clustering fitness evaluation (ECFE) method is proposed to evaluate the ensembles by measuring the consensus clustering under those four objective functions. To generate the consensus clustering, ECFE exploits the hybrid co-association matrix from the ensembles and then dynamically selects the suitable clustering algorithm on that matrix. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed algorithm in comparison with seven clustering algorithms, twelve ensemble clustering approaches, and two multiobjective clustering algorithms on 55 synthetic datasets and 35 real patient stratification datasets. The experimental results demonstrate the competitive edges of the proposed algorithms over those compared methods. Furthermore, the proposed algorithm is applied to extend its advantages by identifying cancer subtypes from five cancer-related single-cell RNA-seq datasets.
Fuzzy C-means (FCM) clustering algorithm is an important and popular clustering algorithm which is utilized in various application domains such as pattern recognition, machine learning, and data mining. Although this ...
详细信息
Fuzzy C-means (FCM) clustering algorithm is an important and popular clustering algorithm which is utilized in various application domains such as pattern recognition, machine learning, and data mining. Although this algorithm has shown acceptable performance in diverse problems, the current literature does not have studies about how they can improve the clustering quality of partitions with overlapping classes. The better the clustering quality of a partition, the better is the interpretation of the data, which is essential to understand real problems. This work proposes two robust FCM algorithms to prevent ambiguous membership into clusters. For this, we compute two types of weights: an weight to avoid the problem of overlapping clusters;and other weight to enable the algorithm to identify clusters of different shapes. We perform a study with synthetic datasets, where each one contains classes of different shapes and different degrees of overlapping. Moreover, the study considered real application datasets. Our results indicate such weights are effective to reduce the ambiguity of membership assignments thus generating a better data interpretation.
暂无评论