model-based clustering is a statistical approach to cluster analysis, which has been successfully deployed in a number of domains due to its principled framework, clear assumptions, and adaptability. For these reasons...
详细信息
model-based clustering is a statistical approach to cluster analysis, which has been successfully deployed in a number of domains due to its principled framework, clear assumptions, and adaptability. For these reasons, there has been substantial interest in applying model-based clustering methods to flow cytometry and mass cytometry data. The identification of relevant cell populations is a crucial step in the analysis of cytometry data for immunological research. Technological advances have led to a rapid increase in the dimensionality and complexity of cytometry data, prompting significant interest in the use of clustering algorithms in place of traditional manual data analysis techniques for cell population identification. This article highlights how model-based clustering methods, such as mixture models, have been adapted to meet the many interesting and unusual challenges that present themselves to the researcher when analyzing flow and mass cytometry data. These innovations demonstrate that there is considerable potential for further methodological development and collaboration between the cytometry and model-based clustering research communities.
Quite often real data exhibit non-normal features, such as asymmetry and heavy tails, and present a latent group structure. In this paper, we first propose the multivariate skew shifted exponential normal distribution...
详细信息
Quite often real data exhibit non-normal features, such as asymmetry and heavy tails, and present a latent group structure. In this paper, we first propose the multivariate skew shifted exponential normal distribution that can account for these non-normal characteristics. Then, we use this distribution in a finite mixture modeling framework. An EM algorithm is illustrated for maximum-likelihood parameter estimation. We provide a simulation study that compares the fitting performance of our model with those of several alternative models. The comparison is also conducted on a real dataset concerning the log returns of four cryptocurrencies.
model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional da...
详细信息
model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.
We propose a procedure, called T-funHDDC, for clustering multivariate functional data with outliers which extends the functional high dimensional data clustering (funHDDC) method (Schmutz et al. in Comput Stat 35:1101...
详细信息
We propose a procedure, called T-funHDDC, for clustering multivariate functional data with outliers which extends the functional high dimensional data clustering (funHDDC) method (Schmutz et al. in Comput Stat 35:1101-1131, 2020) by considering a mixture of multivariate t distributions. We define a family of latent mixture models following the approach used for the parsimonious models considered in funHDDC and also constraining or not the degrees of freedom of the multivariate t distributions to be equal across the mixture components. The parameters of these models are estimated using an expectation maximization algorithm. In addition to proposing the T-funHDDC method, we add a family of parsimonious models to C-funHDDC, which is an alternative method for clustering multivariate functional data with outliers based on a mixture of contaminated normal distributions (Amovin-Assagba et al. in Comput Stat Data Anal 174:107496, 2022). We compare T-funHDDC, C-funHDDC, and other existing methods on simulated functional data with outliers and for real-world data. T-funHDDC outperforms funHDDC when applied to functional data with outliers, and its good performance makes it an alternative to C-funHDDC. We also apply the T-funHDDC method to the analysis of traffic flow in Edmonton, Canada.
Background: clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular ...
详细信息
Background: clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98;such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.
Mixed data refers to a mixture of continuous and categorical variables. The clustering problem with mixed data is a long-standing statistical problem. The latent Gaussian mixture model, a model-based approach for such...
详细信息
Mixed data refers to a mixture of continuous and categorical variables. The clustering problem with mixed data is a long-standing statistical problem. The latent Gaussian mixture model, a model-based approach for such a problem, has received attention owing to its simplicity and interpretability. However, these approaches are prone to dimensionality problems. Specifically, parameters must be estimated for each group, and the number of covariance parameters is quadratic in the number of variables. To address this, we propose "regClustMD," a novel model-based clustering method that can address sparse dependence among variables. We consider a sparse latent Gaussian mixture model, assuming that the precision matrix between variables has sparse nonzero elements. We propose maximizing a penalized complete log-likelihood using the Monte Carlo expectation-maximization (MCEM) algorithm. Our numerical experiments and real data analyses demonstrated that our method outperformed a counterpart algorithm in both accuracy and failure rate under the correlated data structure.
Improved evaluation of anthropogenic contamination is required to sustainably manage groundwater resources. In this study, we investigated the hydrochemical measurements of 18 parameters from a total of 102 bedrock gr...
详细信息
Improved evaluation of anthropogenic contamination is required to sustainably manage groundwater resources. In this study, we investigated the hydrochemical measurements of 18 parameters from a total of 102 bedrock groundwater samples from two representative rural areas in South Korea. We used model-based clustering with a normal (Gaussian) mixture model to differentiate the contributions of natural versus anthropogenic processes to the observed groundwater quality. Water samples varied in hydrochemistry from a Ca-Na-HCO3 type to a Ca-HCO3-Cl type. The former type reflected derivation of major ions largely from water-rock interactions, while the latter type recorded varying degrees of anthropogenic contamination. Among the major dissolved ions, fluoride and nitrate were shown to be good indicators of the two types, respectively. The results of model-based clustering showed that the bivariate normal mixture model, which was based on the covariance of nitrate and fluoride, was more robust than multivariate analysis, and provided better discrimination between the anthropogenic and natural groundwater groups. model-based clustering to measure the degree of cluster membership for each sample also showed a gradual change in groundwater chemistry due to mixing between the two water groups. This study provided an example of the successful application of model-based clustering to evaluate regional groundwater quality and demonstrated that better selection of the dimensional structure (i.e., selection of optimal variables and number of clusters) based on hydrochemistry was crucial in obtaining reasonable clustering results. (C) 2014 Elsevier B.V. All rights reserved.
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically ...
详细信息
The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm-a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student's t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student's t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.
The first model-based clustering algorithm for multivariate functional data is proposed. After introducing multivariate functional principal components analysis (MFPCA), a parametric mixture model, based on the assump...
详细信息
The first model-based clustering algorithm for multivariate functional data is proposed. After introducing multivariate functional principal components analysis (MFPCA), a parametric mixture model, based on the assumption of normality of the principal component scores, is defined and estimated by an EM-like algorithm. The main advantage of the proposed model is its ability to take into account the dependence among curves. Results on simulated and real datasets show the efficiency of the proposed method. (C) 2012 Elsevier B.V. All rights reserved.
This work develops a general procedure for clustering functional data which adapts the clustering method high dimensional data clustering (HDDC), originally proposed in the multivariate context. The resulting clusteri...
详细信息
This work develops a general procedure for clustering functional data which adapts the clustering method high dimensional data clustering (HDDC), originally proposed in the multivariate context. The resulting clustering method, called funHDDC, is based on a functional latent mixture model which fits the functional data in group-specific functional subspaces. By constraining model parameters within and between groups, a family of parsimonious models is exhibited which allow to fit onto various situations. An estimation procedure based on the EM algorithm is proposed for determining both the model parameters and the group-specific functional subspaces. Experiments on real-world datasets show that the proposed approach performs better or similarly than classical two-step clustering methods while providing useful interpretations of the groups and avoiding the uneasy choice of the discretization technique. In particular, funHDDC appears to always outperform HDDC applied on spline coefficients.
暂无评论