We consider a discrete latent variable model for two-way data arrays, which allows one to simultaneously produce clusters along one of the data dimensions (e.g.,exchangeable observational units or features) and contig...
详细信息
We consider a discrete latent variable model for two-way data arrays, which allows one to simultaneously produce clusters along one of the data dimensions (e.g.,exchangeable observational units or features) and contiguous groups, or segments, along the other (e.g.,consecutively ordered times or locations). The model relies on a hidden Markov structure but, given its complexity, cannot be estimated by full maximum likelihood. Therefore, we introduce a composite likelihood methodology based on considering different subsets of the data. The proposed approach is illustrated by simulation, and with an application to genomic data.
BackgroundWith the advances in the next-generation sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assign...
详细信息
BackgroundWith the advances in the next-generation sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assignments in metagenomic samples, we developed a method that allows multiple mismatch probabilities from different *** extended the algorithm of taxonomic assignment of metagenomic sequence reads (TAMER) by developing an improved method that can set a different mismatch probability for each genome rather than imposing a single parameter for all genomes, thereby obtaining a greater degree of accuracy. This method, which we call TADIP (Taxonomic Assignment of metagenomics based on DIfferent Probabilities), was comprehensively tested in simulated and real datasets. The results support that TADIP improved the performance of TAMER especially in large sample size datasets with high *** was developed as a statistical model to improve the estimate accuracy of taxonomy assignments. Based on its varying mismatch probability setting and correlated variance matrix setting, its performance was enhanced for high complexity samples when compared with TAMER.
We consider a mixture model with latent Bayesian network (MLBN) for a set of random vectors X-(t), X-(t) is an element of R-dt, t = 1, ..., T. Each X-(t) is associated with a latent state s(t), given which X-(t) is co...
详细信息
We consider a mixture model with latent Bayesian network (MLBN) for a set of random vectors X-(t), X-(t) is an element of R-dt, t = 1, ..., T. Each X-(t) is associated with a latent state s(t), given which X-(t) is conditionally independent from other variables. The joint distribution of the states is governed by a Bayes net. Although specific types of MLBN have been used in diverse areas such as biomedical research and image analysis, the exact expectation-maximization (em) algorithm for estimating the models can involve visiting all the combinations of states, yielding exponential complexity in the network size. A prominent exception is the Baum-Welch algorithm for the hidden Markov model, where the underlying graph topology is a chain. We hereby develop a new Baum-Welch algorithm on directed acyclic graph (BW-DAG) for the general MLBN and prove that it is an exact em algorithm. BW-DAG provides insight on the achievable complexity of em. For a tree graph, the complexity of BW-DAG is much lower than that of the brute-force em. Copyright (c) 2017 John Wiley & Sons, Ltd.
In this paper, we develop a bivariate discrete generalized exponential distribution, whose marginals are discrete generalized exponential distribution as proposed by Nekoukhou, Alamatsaz and Bidram [Discrete generaliz...
详细信息
In this paper, we develop a bivariate discrete generalized exponential distribution, whose marginals are discrete generalized exponential distribution as proposed by Nekoukhou, Alamatsaz and Bidram [Discrete generalized exponential distribution of a second type. Statistics. 2013;47:876-887]. It is observed that the proposed bivariate distribution is a very flexible distribution and the bivariate geometric distribution can be obtained as a special case of this distribution. The proposed distribution can be seen as a natural discrete analogue of the bivariate generalized exponential distribution proposed by Kundu and Gupta [Bivariate generalized exponential distribution. J Multivariate Anal. 2009;100:581-593]. We study different properties of this distribution and explore its dependence structures. We propose a new em algorithm to compute the maximum-likelihood estimators of the unknown parameters which can be implemented very efficiently, and discuss some inferential issues also. The analysis of one data set has been performed to show the effectiveness of the proposed model. Finally, we propose some open problems and conclude the paper.
Non-negative matrix factorization (NMF) is a technique of multivariate analysis used to approximate a given matrix containing non-negative data using two non-negative factor matrices that has been applied to a number ...
详细信息
Non-negative matrix factorization (NMF) is a technique of multivariate analysis used to approximate a given matrix containing non-negative data using two non-negative factor matrices that has been applied to a number of fields. However, when a matrix containing non-negative data has many zeroes, NMF encounters an approximation difficulty. This zero-inflated situation occurs often when a data matrix is given as count data, and becomes more challenging with matrices of increasing size. To solve this problem, we propose a new NMF model for zero-inflated non-negative matrices. Our model is based on the zero-inflated Tweedie distribution. The Tweedie distribution is a generalization of the normal, the Poisson, and the gamma distributions, and differs from each of the other distributions in the degree of robustness of its estimated parameters. In this paper, we show through numerical examples that the proposed model is superior to the basic NMF model in terms of approximation of zero-inflated data. Furthermore, we show the differences between the estimated basis vectors found using the basic and the proposed NMF models for divergence by applying it to real purchasing data.
It is well known that the widely popular mean regression model could be inadequate if the probability distribution of the observed responses do not follow a symmetric distribution. To deal with this situation, the qua...
详细信息
It is well known that the widely popular mean regression model could be inadequate if the probability distribution of the observed responses do not follow a symmetric distribution. To deal with this situation, the quantile regression turns to be a more robust alternative for accommodating outliers and the misspecification of the error distribution because it characterizes the entire conditional distribution of the outcome variable. This paper presents a likelihood-based approach for the estimation of the regression quantiles based on a new family of skewed distributions. This family includes the skewed version of normal, Student-t, Laplace, contaminated normal and slash distribution, all with the zero quantile property for the error term and with a convenient and novel stochastic representation that facilitates the implementation of the expectation-maximization algorithm for maximum likelihood estimation of the pth quantile regression parameters. We evaluate the performance of the proposed expectation-maximization algorithm and the asymptotic properties of the maximum likelihood estimates through empirical experiments and application to a real-life dataset. The algorithm is implemented in the R package lqr, providing full estimation and inference for the parameters as well as simulation envelope plots useful for assessing the goodness of fit. Copyright (C) 2017 John Wiley & Sons, Ltd.
In this paper we show that a recently developed method for the study of ''cultural'' differences, called DBS-em, or Distance Between Strata estimated with the em (Expectation Maximization) algorithm, c...
详细信息
In this paper we show that a recently developed method for the study of ''cultural'' differences, called DBS-em, or Distance Between Strata estimated with the em (Expectation Maximization) algorithm, can also be used to circumvent the difficulties posed by APC (or Age, Period, Cohort) models. The DBS-em method produces an original measure of the distance (dependent variable) between any two subsets of observations (strata) within a sample, where the stratification variables can be interpreted as regressors. When these stratification variables are age, period, and cohort, what results is an APC model which, however, proves immune to the ''intrinsic collinearity problem'' (C = P-A). With a few limitations, to be sure, which are discussed in the article. In our application to Italian data over the years 1993-2013, age and cohort strongly shape cultural consumption, while cohort and period impact, but only up to a point, on political participation.
Cancer studies frequently yield multiple event times that correspond to landmarks in disease progression, including non-terminal events (i.e., cancer recurrence) and an informative terminal event (i.e., cancer-related...
详细信息
Cancer studies frequently yield multiple event times that correspond to landmarks in disease progression, including non-terminal events (i.e., cancer recurrence) and an informative terminal event (i.e., cancer-related death). Hence, we often observe semi-competing risks data. Work on such data has focused on scenarios in which the cause of the terminal event is known. However, in some circumstances, the information on cause for patients who experience the terminal event is missing;consequently, we are not able to differentiate an informative terminal event from a non-informative terminal event. In this article, we propose a method to handle missing data regarding the cause of an informative terminal event when analyzing the semi-competing risks data. We first consider the nonparametric estimation of the survival function for the terminal event time given missing cause-of-failure data via the expectation-maximization algorithm. We then develop an estimation method for semi-competing risks data with missing cause of the terminal event, under a pre-specified semiparametric copula model. We conduct simulation studies to investigate the performance of the proposed method. We illustrate our methodology using data from a study of early-stage breast cancer. Copyright (C) 2016 John Wiley & Sons, Ltd.
The paper investigates whether accounting for unobserved heterogeneity in farms' size transition processes improves the representation of structural change in agriculture. Considering a mixture of two types of far...
详细信息
The paper investigates whether accounting for unobserved heterogeneity in farms' size transition processes improves the representation of structural change in agriculture. Considering a mixture of two types of farm, the mover-stayer model is applied for the first time in an agricultural economics context. The maximum likelihood method and the expectation-maximization algorithm are used to estimate the model's parameters. An empirical application to a panel of French farms from 2000 to 2013 shows that the mover-stayer model outperforms the homogeneous Markov chain model in recovering the transition process and predicting the future distribution of farm sizes.
In this paper, the destructive negative binomial (DNB) cure rate model with a latent activation scheme [V. Cancho, D. Bandyopadhyay, F. Louzada, and B. Yiqi, The DNB cure rate model with a latent activation scheme, St...
详细信息
In this paper, the destructive negative binomial (DNB) cure rate model with a latent activation scheme [V. Cancho, D. Bandyopadhyay, F. Louzada, and B. Yiqi, The DNB cure rate model with a latent activation scheme, Statistical Methodology 13 (2013b), pp. 48-68] is extended to the case where the observations are grouped into clusters. Parameter estimation is performed based on the restricted maximum likelihood approach and on a Bayesian approach based on Dirichlet process priors. An application to a real data set related to a sealant study in a dentistry experiment is considered to illustrate the performance of the proposed model.
暂无评论