Electronic mail is the primary source of different cyber scams. Identifying the author of electronic mail is essential. It forms significant documentary evidence in the field of digital forensics. This paper presents ...
详细信息
Electronic mail is the primary source of different cyber scams. Identifying the author of electronic mail is essential. It forms significant documentary evidence in the field of digital forensics. This paper presents a model for email author identification (or) attribution by utilizing deep neural networks and model-based clustering techniques. It is perceived that stylometry features in the authorship identification have gained a lot of importance as it enhances the author attribution task's accuracy. The experiments were performed on a publicly available benchmark Enron dataset, considering many authors. The proposed model achieves an accuracy of 94% on five authors, 90% on ten authors, 86% on 25 authors and 75% on the entire dataset for the Deep Neural Network technique, which is a good measure of accuracy on a highly imbalanced data. The second cluster-based technique yielded an excellent 86% accuracy on the entire dataset, considering the authors' number based on their contribution to the aggregate data.
An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likeliho...
详细信息
An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generalization of thek-means algorithm, which is itself equivalent to a restricted Gaussian mixture model. The EA is illustrated on several datasets, and its performance is compared with that of other hard clustering approaches and model-based clustering via the EM algorithm.
Functional magnetic resonance imaging (fMRI) data have become increasingly available and are useful for describing functional connectivity (FC), the relatedness of neuronal activity in regions of the brain. This FC of...
详细信息
Functional magnetic resonance imaging (fMRI) data have become increasingly available and are useful for describing functional connectivity (FC), the relatedness of neuronal activity in regions of the brain. This FC of the brain provides insight into certain neurodegenerative diseases and psychiatric disorders, and thus is of clinical importance. To help inform physicians regarding patient diagnoses, unsupervised clustering of subjects based on FC is desired, allowing the data to inform us of groupings of patients based on shared features of connectivity. Since heterogeneity in FC is present even between patients within the same group, it is important to allow subject-level differences in connectivity, while still pooling information across patients within each group to describe group-level FC. To this end, we propose a random covariance clusteringmodel (RCCM) to concurrently cluster subjects based on their FC networks, estimate the unique FC networks of each subject, and to infer shared network features. Although current methods exist for estimating FC or clustering subjects using fMRI data, our novel contribution is to cluster or group subjects based on similar FC of the brain while simultaneously providing group- and subject-level FC network estimates. The competitive performance of RCCM relative to other methods is demonstrated through simulations in various settings, achieving both improved clustering of subjects and estimation of FC networks. Utility of the proposed method is demonstrated with application to a resting-state fMRI data set collected on 43 healthy controls and 61 participants diagnosed with schizophrenia.
作者:
Dias, Jose G.Vermunt, Jeroen K.ISCTE
Dept Quantitat Methods Higher Inst Social Sci & Business Studies P-1649026 Lisbon Portugal ISCTE
UNIDE P-1649026 Lisbon Portugal Tilburg Univ
Dept Methodol & Stat NL-5000 LE Tilburg Netherlands
In model-based clustering, a situation in which true class labels are unknown and that is therefore also referred to as unsupervised learning, observations are typically classified by the Bayes modal rule. In this stu...
详细信息
In model-based clustering, a situation in which true class labels are unknown and that is therefore also referred to as unsupervised learning, observations are typically classified by the Bayes modal rule. In this study, we assess whether alternative classifiers from the classification or supervised-learning literature-developed for situations in which class labels are known-can improve the Bayes rule. More specifically, we investigate the performance of bootstrap-based aggregate (bagging) rules after adapting these to the model-based clustering context. It is argued that specific issues, such as the label-switching problem, have to be carefully addressed when using bootstrap methods in model-based clustering. Our two Monte Carlo studies show that classification based on the Bayes rule is rather stable and difficult to improve by bootstrap-based aggregate rules, even for sparse data. An empirical example illustrates the various approaches described in this paper.
Parsimonious finite mixture models often require the a priori selection of desired model dimensionality. For example, projection-based parsimonious models demand the dimension of the subspace for projection. Other mod...
详细信息
Parsimonious finite mixture models often require the a priori selection of desired model dimensionality. For example, projection-based parsimonious models demand the dimension of the subspace for projection. Other models ask for their own structural restrictions on parameters. The subspace clustering framework is a projection-based parsimonious model for various finite mixtures, including the Gaussian variant. The existing dimension selection methods for subspace clustering are ad-hoc or potentially computationally prohibitive, creating a need for a principled, yet computationally lightweight, approach. In light of this problem, a hypothesis test-based intrinsic dimension estimation method called the Anderson Relaxation Test (ART) is introduced, and its performance is examined in both simulated and real data settings.
This paper considers the problem of forecasting high-dimensional time series. It employs a robust clustering approach to perform classification of the component series. Each series within a cluster is assumed to follo...
详细信息
This paper considers the problem of forecasting high-dimensional time series. It employs a robust clustering approach to perform classification of the component series. Each series within a cluster is assumed to follow the same model and the data are then pooled for estimation. The classification is model-based and robust to outlier contamination. The robustness is achieved by using the intrinsic mode functions of the Hilbert-Huang transform at lower frequencies. These functions are found to be robust to outlier contamination. The paper also compares out-of-sample forecast performance of the proposed method with several methods available in the literature. The other forecasting methods considered include vector autoregressive models with/without LASSO, group LASSO, principal component regression, and partial least squares. The proposed method is found to perform well in out-of-sample forecasting of the monthly unemployment rates of 50 US states. Copyright (c) 2013 John Wiley & Sons, Ltd.
We report the use of a cluster analysis method based on a multivariate mixture model, known as model-based clustering, for overcoming the limitations of hierarchical clustering and relocation clustering. Unlike tradit...
详细信息
We report the use of a cluster analysis method based on a multivariate mixture model, known as model-based clustering, for overcoming the limitations of hierarchical clustering and relocation clustering. Unlike traditional clustering methods in which clusters are formed on the basis of intercluster distances, model-based clustering classifies observations on the basis of probability estimated from Gaussian mixture modeling, and its statistical basis allows for inference. Three examples are given in which we demonstrate that model-based clustering gives much better performance for overlapping clusters, a more reliable determination of the number of clusters in data, and better identification of clustering in the presence of outliers than agglomerative hierarchical clustering or iterative relocation clustering using a K-means criterion. We also show that Markov chain Monte Carlo simulation, as implemented via Gibbs sampling coupled with model-based clustering, may be used to assess uncertainty of group memberships. Copyright (c) 2013 John Wiley & Sons, Ltd. We illustrate the use of model-based clustering and show three examples in which model-based clustering gives much better performance for overlapping clusters, a more reliable determination of the number of clusters in data, and better identification of clustering than other clustering methods. We also show that Markov chain Monte Carlo simulation, as implemented via Gibbs sampling coupled with model-based clustering, may be used to assess uncertainty of cluster group memberships.
Finite Gaussian mixture models provide a powerful and widely employed probabilistic approach for clustering multivariate continuous data. However, the practical usefulness of these models is jeopardized in high-dimens...
详细信息
Finite Gaussian mixture models provide a powerful and widely employed probabilistic approach for clustering multivariate continuous data. However, the practical usefulness of these models is jeopardized in high-dimensional spaces, where they tend to be over-parameterized. As a consequence, different solutions have been proposed, often relying on matrix decompositions or variable selection strategies. Recently, a methodological link between Gaussian graphical models and finite mixtures has been established, paving the way for penalized model-based clustering in the presence of large precision matrices. Notwithstanding, current methodologies implicitly assume similar levels of sparsity across the classes, not accounting for different degrees of association between the variables across groups. We overcome this limitation by deriving group-wise penalty factors, which automatically enforce under or over-connectivity in the estimated graphs. The approach is entirely data-driven and does not require additional hyper-parameter specification. Analyses on synthetic and real data showcase the validity of our proposal.
Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by usin...
详细信息
Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often computationally expensive because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood estimate and its maximization appears to be simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumed. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. The proposed approach is implemented in the R package VarSelLCM available on CRAN.
The expectation-maximization (EM) algorithm is a commonly used method for finding the maximum likelihood estimates of the parameters in a mixture model via coordinate ascent. A serious pitfall with the algorithm is th...
详细信息
The expectation-maximization (EM) algorithm is a commonly used method for finding the maximum likelihood estimates of the parameters in a mixture model via coordinate ascent. A serious pitfall with the algorithm is that in the case of multimodal likelihood functions, it can get trapped at a local maximum. This problem often occurs when sub-optimal starting values are used to initialize the algorithm. Bayesian initialization averaging (BIA) is proposed as an ensemble method to generate high quality starting values for the EM algorithm. Competing sets of trial starting values are combined as a weighted average, which is then used as the starting position for a full EM run. The method can also be extended to variational Bayes methods, a class of algorithm similar to EM that is based on an approximation of the model posterior. The BIA method is demonstrated on real continuous, categorical and network data sets, and the convergent log-likelihoods and associated clustering solutions presented. These compare favorably with the output produced using competing initialization methods such as random starts, hierarchical clustering and deterministic annealing, with the highest available maximum likelihood estimates obtained in a higher percentage of cases, at reasonable computational cost. For the Stochastic Block model for network data promising results are demonstrated even when the likelihood is unavailable. The implications of the different clustering solutions obtained by local maxima are also discussed.
暂无评论