Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this paper we announce the BigARTM open source project (http://***) for regularized multimodal topic modeling of la...
详细信息
ISBN:
(纸本)9783319261232;9783319261225
Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this paper we announce the BigARTM open source project (http://***) for regularized multimodal topic modeling of large collections. Several experiments on Wikipedia corpus show that BigARTM performs faster and gives better perplexity comparing to other popular packages, such as Vowpal Wabbit and Gensim. We also demonstrate several unique BigARTM features, such as additive combination of regularizers, topic sparsing and decorrelation, multimodal and multilanguage modeling, which are not available in the other software packages for topic modeling.
The aim of this work is to present an inference and diagnostic study of an extension of the lifetime distribution family proposed by Birnbaum and Saunders (1969a, b). This extension is ob- tained by considering a skew...
详细信息
The aim of this work is to present an inference and diagnostic study of an extension of the lifetime distribution family proposed by Birnbaum and Saunders (1969a, b). This extension is ob- tained by considering a skew-elliptical distribution instead of the normal distribution. Specifically, in this work we develop a Birnbaum-Saunders (BS) distribution type based on scale mixtures of skew-normal distributions (SMSN). The resulting family of lifetime distributions represents a robust extension of the usual BS dis- tribution. Based on this family, we reproduce the usual properties of the BS distribution, and present an estimation method based on the emalgorithm. In addition, we present regression models associated with the BS distributions (based on scale mixtures of skew-normal), which are developed as an extension of the sinh-normal distribution (Rieck and Nedelman, 1991)). For this model we consider an estimation and diagnostic study for uncensored data.
ABSTRACTThe objectives of this dissertation are: To introduce a modified em-algorithm for finding estimates of the parameters of the failure time distribution based on time-censored Wiener degradation data (Part I); T...
详细信息
ABSTRACTThe objectives of this dissertation are: To introduce a modified em-algorithm for finding estimates of the parameters of the failure time distribution based on time-censored Wiener degradation data (Part I); To design an optimal degradation test to minimize the asymptotic variance of the percentile estimates of the failure time distribution, subject to certain budget constraint (Part II).Being the solution to the stochastic linear growth model, the Wiener process has recently been used to model the degradation (or cumulative decay) of certain characteristics of test units in lifetime data analyses. When the failure threshold is constant or linear in time, the failure time, which is defined as the first-passage time of the Wiener process over the failure threshold, will follow an inverse Gaussian (IG) distribution. In this thesis we consider a time-censored degradation test, where, in addition to the failure times of the failed units, we assume that the degradation values at the censor times of the censored units are also available. Then for Part I, based on these degradation values, we use a modified em-algorithm to predict the failure times of the censored units. The resulting estimator of the mean failure time is shown to be a consistent estimator, and is also an estimator that maximizes the (modified) likelihood function of the available failure times and degradation values. For the scale parameter of the IG distribution, the algorithm produces an inconsistent estimator, for which we introduce two modified estimators to reduce the bias. Analytical as well as numerical comparisons show that our proposed estimators perform well, as compared to the traditional MLEs and the modified MLEs, for both IG *** Part II, we have derived Fisher’s information, and asymptotic variance of sample percentile, for the time-censored case as objection function, and expected total cost as the constraint for finding optimal accelerated path. And we give some necessar
This paper presents a model for the interpretation of results of STR typing of DNA mixtures based on a multivariate normal distribution of peak areas. From previous analyses of controlled experiments with mixed DNA sa...
详细信息
This paper presents a model for the interpretation of results of STR typing of DNA mixtures based on a multivariate normal distribution of peak areas. From previous analyses of controlled experiments with mixed DNA samples, we exploit the linear relationship between peak heights and peak areas, and the linear relations of the means and variances of the measurements. Furthermore, the contribution from one individual's allele to the mean area of this allele is assumed proportional to the average of height measurements on alleles where the individual is the only contributor. For shared alleles in mixed DNA samples, it is only possible to observe the cumulative peak heights and areas. Complying with this latent structure, we use the em-algorithm to impute the missing variables based on a compound symmetry model. That is the measurements are subject to intra- and inter-loci correlations not depending on the actual alleles of the DNA profiles. Due to factorization of the likelihood, properties of the normal distribution and use of auxiliary variables, an ordinary implementation of the em-algorithm solves the missing data problem. We estimate the parameters in the model based on a training data set. In order to assess the weight of evidence provided by the model, we use the model with the estimated parameters on STR data from real crime cases with DNA mixtures. (C) 2008 Elsevier Ireland Ltd. All rights reserved.
Landmarking MR images is crucial in registering brain structures from different images. It consists in locating the voxel in the image that corresponds to a well-defined point in the anatomy, called the landmark. Exam...
详细信息
ISBN:
(纸本)0819457213
Landmarking MR images is crucial in registering brain structures from different images. It consists in locating the voxel in the image that corresponds to a well-defined point in the anatomy, called the landmark. Example of landmarks are the apex of the head (HoH) of Hippocampus, the tail and the tip of the splenium of the corpus collosum (SCC). Hand landmarking is tedious and time-consuming. It requires an adequate training, Experimental studies show that the results are dependent on the landmarker and drifting with time. We propose a generic algorithm performing automated detection of landmarks. The first part consists in learning from a training set of landmarked images the parameters of a probabilistic model, using the emalgorithm. The second part inputs the estimated parameters and a new image, and outputs a voxel as a predicted location for the landmark. The algorithm is demonstrated on the HoH and the SCC. In contrast with competing approaches, the algorithm is generic: it can be used to detect any landmark, given a, hand-labeled training set of images.
In this paper, the estimation of parameters based on a progressively typeI interval censored sample from a Pareto distribution is studied. Different methods of estimation are discussed, which include mid-point approxi...
详细信息
In this paper, the estimation of parameters based on a progressively typeI interval censored sample from a Pareto distribution is studied. Different methods of estimation are discussed, which include mid-point approximation estimator, the maximum likelihood estimator and moment estimator. The estimation procedures are discussed in details and compared via Monte Carlo simulations in terms of their biases.
Fractionally-supervised classification (FSC) is a recently proposed classification method in the literature that combines the finite mixture model (FMM), weighted likelihood, and Expectation-Maximization (em) algorith...
详细信息
Fractionally-supervised classification (FSC) is a recently proposed classification method in the literature that combines the finite mixture model (FMM), weighted likelihood, and Expectation-Maximization (em) algorithm to adjust the weight of labeled (unlabeled) data in the training process of a classifier and obtain the best classification result. All the results in the literature pertinent to FSC are based on simple random sampling (SRS). In this thesis, we extend FSC approach to a ranked-based type sampling design called nominated sampling (NS), which collects more representative data than SRS from tails of the underlying population. We show that the usual emalgorithm for finite mixture modeling using nominated samples leads to incorrect maximization problems. In this thesis, we propose a set of proper latent variables and modify the usual emalgorithm for the FSC approach based on maxima (minima) nominated samples and evaluate the estimation and classification results. We compare the mean squared error (MSE) of estimates obtained by FSC with two emalgorithms and observe that the emalgorithm with proper latent variable has a higher relative efficiency when applying NS samples. Moreover, we compute the adjusted Rand index (ARI) to assess the classification performance in different weights of unlabeled data and determine the best choice of weight for the purpose of FSC.
In this work we deal with cluster analysis for functional data. Functional data contain a set of subjects that are characterized by repeated measurements of a variable. Based on these measurements we want to split the...
详细信息
In this work we deal with cluster analysis for functional data. Functional data contain a set of subjects that are characterized by repeated measurements of a variable. Based on these measurements we want to split the subjects into groups (clusters). The subjects in a single cluster should be similar and differ from subjects in the other clusters. The first approach we use is the reduction of data dimension followed by the clustering method K-means. The second approach is to use a finite mixture of normal linear mixed models. We estimate parameters of the model by maximum likelihood using the emalgorithm. Throughout the work we apply all described procedures to real meteorological data.
The main objective of this paper is to build stochastic models to describe the evolution-in-time of a system and to estimate its characteristics when direct observations of the system state are not available. One impo...
详细信息
The main objective of this paper is to build stochastic models to describe the evolution-in-time of a system and to estimate its characteristics when direct observations of the system state are not available. One important application area arises with the deployment of sensor networks that have become ubiquitous nowadays with the purpose of observing and controlling industrial equipment. The model is based on hidden Markov processes where the observation at a given time depends not only on the current hidden state but also on the previous observations. Some reliability measures are defined in this context and a sensitivity analysis is presented in order to control for false positive (negative) signals that would lead to believe erroneously that the system is in failure (working) when actually it is not. System maintenance aspects based on the model are considered, and the concept of signal-runs is introduced. A simulation study is carried out to evaluate the finite sample performance of the method and a real application related to a water-pump system monitored by a set of sensors is also discussed.
Heterogeneous real datasets need complex probabilistic structures for a correct modeling. On the other hand, several generalizations of the Kumaraswamy distribution have been developed in the past few decades in an at...
详细信息
Heterogeneous real datasets need complex probabilistic structures for a correct modeling. On the other hand, several generalizations of the Kumaraswamy distribution have been developed in the past few decades in an attempt to obtain better data adjustments that are limited in the interval (0,1). In this paper, we propose a mixture model of Kumaraswamy distributions (MMK) as a probabilistic structure for heterogeneous datasets with support in (0,1) and as an important generalization of the Kumaraswamy distribution. We derive the moments, the moment-generating function and analyze the failure rate function. Also, we prove the identifiability of the class of all finite mixtures of Kumaraswamy distributions. Via the em-algorithm, we find estimates of maximum likelihood for the parameters of the MMK. Finally, we test the performance of the estimates by Monte Carlo simulation and illustrate an application of the proposed model using a real dataset.
暂无评论