Probabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic r...
详细信息
Probabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic record linkage arises in comparative effectiveness research and other clinical research scenarios when records in different databases do not share an error-free unique patient identifier. This dissertation seeks to develop new methodology for probabilistic record linkage to address two highly practical and recurring challenges: how to implement record linkage in a manner that optimizes downstream statistical analyses of the linked data, and how to efficiently link databases having a clustered or multi-level data *** Chapter 2 we propose a new framework for balancing the tradeoff between false positive and false negative linkage errors when linked data are analyzed in a generalized linear model framework and non-linked records lead to missing data for the study outcome variable. Our method seeks to maximize the probability that the point estimate of the parameter of interest will have the correct sign and that the confidence interval around this estimate will correctly exclude the null value of zero. Using large sample approximations and a model for linkage errors, we derive expressions relating bias and hypothesis testing power to the user's choice of threshold that determines how many records will be linked. We use these results to propose three data-driven threshold selection rules. Under one set of simplifying assumptions we prove that maximizing asymptotic power requires that the threshold be relaxed at least until the point where all pairs with >50% probability of being a true match are *** Chapter 3 we explore the consequences of linkage errors when the study outcome variable is determined by linkage status and so linkage errors may cause outcome misclassification. This scenario arises when the outcome is disease status and those lin
In linear mixed models, the assumption of normally distributed random effects is often inappropriate and unnecessarily restrictive. The proposed approximate Dirichlet process mixture assumes a hierarchical Gaussian mi...
详细信息
In linear mixed models, the assumption of normally distributed random effects is often inappropriate and unnecessarily restrictive. The proposed approximate Dirichlet process mixture assumes a hierarchical Gaussian mixture that is based on the truncated version of the stick breaking presentation of the Dirichlet process. In addition to the weakening of distributional assumptions, the specification allows to identify clusters of observations with a similar random effects structure. An Expectation-Maximization algorithm is given that solves the estimation problem and that, in certain respects, may exhibit advantages over Markov chain Monte Carlo approaches when modelling with Dirichlet processes. The method is evaluated in a simulation study and applied to the dynamics of unemployment in Germany as well as lung function growth data.
Existing methods for estimating the parameters of the Growth Curve Model (GCM) rely on the assumption that the underlying distribution for the error terms is multivariate normal. However, we often come across skewed d...
详细信息
Existing methods for estimating the parameters of the Growth Curve Model (GCM) rely on the assumption that the underlying distribution for the error terms is multivariate normal. However, we often come across skewed data in practical applications;and estimators developed under the normality assumption may not be valid in such situations. Simulation studies conducted in this paper, in fact, show that existing methods are sensitive to skewness, where normal based estimators are associated with increased bias and mean squared error (MSE), when the normality assumption is violated. Methods appropriate for skewed distributions are, therefore, required. In this paper, estimators for the mean and covariance matrices of the GCM under multivariate skew normal (MSN) distribution are proposed. An estimator for the additional skewness parameter of the MSN distribution is also provided. The estimators are derived using the expectation maximization (em) algorithm and extensive simulations are performed to examine the performance of the estimators. Comparisons with existing estimators show that our estimators perform better than the existing estimators, when the underlying distribution is multivariate skew normal. Illustration using real data set is also provided.
La description des co-variations entre plusieurs variables aléatoires observées est un problème délicat. Les réseaux de dépendance sont des outils populaires qui décrivent les relati...
详细信息
La description des co-variations entre plusieurs variables aléatoires observées est un problème délicat. Les réseaux de dépendance sont des outils populaires qui décrivent les relations entre les variables par la présence ou l’absence d’arêtes entre les nœuds d’un graphe. En particulier, les graphes de corrélations conditionnelles sont utilisés pour représenter les corrélations “directes” entre les nœuds du graphe. Ils sont souvent étudiés sous l’hypothèse gaussienne et sont donc appelés “modèles graphiques gaussiens” (GGM). Un seul réseau peut être utilisé pour représenter les tendances globales identifiées dans un échantillon de données. Toutefois, lorsque les données observées sont échantillonnées à partir d’une population hétérogène, il existe alors différentes sous-populations qui doivent toutes être décrites par leurs propres graphes. De plus, si les labels des sous populations (ou “classes”) ne sont pas disponibles, des approches non supervisées doivent être mises en œuvre afin d’identifier correctement les classes et de décrire chacune d’entre elles avec son propre graphe. Dans ce travail, nous abordons le problème relativement nouveau de l’estimation hiérarchique des GGM pour des populations hétérogènes non labellisées. Nous explorons plusieurs axes clés pour améliorer l’estimation des paramètres du modèle ainsi que l’identification non supervisee des sous-populations. ´ Notre objectif est de s’assurer que les graphes de corrélations conditionnelles inférés sont aussi pertinents et interprétables que possible. Premièrement - dans le cas d’une population simple et homogène - nous développons une méthode composite qui combine les forces des deux principaux paradigmes de l’état de l’art afin d’en corriger les faiblesses. Pour le cas hétérogène non labellisé, nous proposons d’estimer un mélange de GGM avec un algorithme espérance-maximisation (em). Afin d’améliorer les solutions de cet algorithme em, et d’éviter de tomber dans des extrema locaux sous-optimaux q
Drug addiction can lead to many health-related problems and social concerns. Researchers are interested in the association between long-term drug usage and abnormal functional connec-tivity. Functional connectivity ob...
详细信息
Drug addiction can lead to many health-related problems and social concerns. Researchers are interested in the association between long-term drug usage and abnormal functional connec-tivity. Functional connectivity obtained from functional magnetic resonance imaging data promotes a variety of fundamental un-derstandings in such association. Due to the complex correlation structure and large dimensionality, the modeling and analysis of the functional connectivity from neuroimage are challenging. By proposing a spatio-temporal model for multi-subject neuroimage data, we incorporate voxel-level spatio-temporal dependencies of whole-brain measurements to improve the accuracy of statis-tical inference. To tackle large-scale spatio-temporal neuroimage data, we develop a computational efficient algorithm to estimate the parameters. Our method is used to first identify functional connectivity, and then detect the effect of cocaine use disorder (CUD) on functional connectivity between different brain regions. The functional connectivity identified by our spatio-temporal model matches existing studies on brain networks, and further indicates that CUD may alter the functional connectivity in the medial orbitofrontal cortex subregions and the supplementary motor areas. (c) 2021 Elsevier B.V. All rights reserved.
Latent trait models such as item response theory (IRT) hypothesize a functional relationship between an unobservable, or latent, variable and an observable outcome variable. In educational measurement, a discrete item...
详细信息
Latent trait models such as item response theory (IRT) hypothesize a functional relationship between an unobservable, or latent, variable and an observable outcome variable. In educational measurement, a discrete item response is usually the observable outcome variable, and the latent variable is associated with an examinee's trait level (e.g., skill, proficiency). The link between the two variables is called an item response function. This function, defined by a set of item parameters, models the probability of observing a given item response, conditional on a specific trait level. Typically in a measurement setting, neither the item parameters nor the trait levels are known, and so must be estimated from the pattern of observed item responses. Although a maximum likelihood approach can be taken in estimating these parameters, it usually cannot be employed directly. Instead, a method of marginal maximum likelihood (MML) is utilized, via the expectation-maximization (em) algorithm. Alternating between an expectation (E) step and a maximization (M) step, the em algorithm assures that the marginal log likelihood function will not decrease after each em cycle, and will converge to a local maximum. Interestingly, the negative of this marginal log likelihood function is equal to the relative entropy, or Kullback-Leibler divergence, between the conditional distribution of the latent variables given the observable variables and the joint likelihood of the latent and observable variables. With an unconstrained optimization for the M-step proposed here, the em algorithm as minimization of Kullback-Leibler divergence admits the convergence results due to Csiszar and Tusnady (Statistics & Decisions, 1:205-237, 1984), a consequence of the binomial likelihood common to latent trait models with dichotomous response variables. For this unconstrained optimization, the em algorithm converges to a global maximum of the marginal log likelihood function, yielding an information bound t
In this paper, a new mixture family of multivariate normal distributions, formed by mixing multivariate normal distribution and a skewed distribution, is constructed. Some properties of this family, such as characteri...
详细信息
In this paper, a new mixture family of multivariate normal distributions, formed by mixing multivariate normal distribution and a skewed distribution, is constructed. Some properties of this family, such as characteristic function, moment generating function, and the first four moments are derived. The distributions of affine transformations and canonical forms of the model are also derived. An em-type algorithm is developed for the maximum likelihood estimation of model parameters. Some special cases of the family, using standard gamma and standard exponential mixture distributions, denoted by MMNG and MMNE, respectively, are considered. For the proposed family of distributions, different multivariate measures of skewness are computed. In order to examine the performance of the developed estimation method, some simulation studies are carried out to show that the maximum likelihood estimates do provide a good performance. For different choices of parameters of MMNE distribution, several multivariate measures of skewness are computed and compared. Because some measures of skewness are scalar and some are vectors, in order to evaluate them properly, a simulation study is carried out to determine the power of tests, based on sample versions of skewness measures as test statistics for testing the fit of the MMNE distribution. Finally, two real data sets are used to illustrate the usefulness of the proposed model and the associated inferential methods. (C) 2020 Elsevier Inc. All rights reserved.
Mendelian randomization (MR) is a powerful instrumental variable (IV) method for estimating the causal effect of an exposure on an outcome of interest even in the presence of unmeasured confounding by using genetic va...
详细信息
Mendelian randomization (MR) is a powerful instrumental variable (IV) method for estimating the causal effect of an exposure on an outcome of interest even in the presence of unmeasured confounding by using genetic variants as IVs. However, the correlated and idiosyncratic pleiotropy phenomena in the human genome will lead to biased estimation of causal effects if they are not properly accounted for. In this article, we develop a novel MR approach named MRCIP to account for correlated and idiosyncratic pleiotropy simultaneously. We first propose a random-effect model to explicitly model the correlated pleiotropy and then propose a novel weighting scheme to handle the presence of idiosyncratic pleiotropy. The model parameters are estimated by maximizing a weighted likelihood function with our proposed PRW-em algorithm. Moreover, we can also estimate the degree of the correlated pleiotropy and perform a likelihood ratio test for its presence. Extensive simulation studies show that the proposed MRCIP has improved performance over competing methods. We also illustrate the usefulness of MRCIP on two real datasets. The R package for MRCIP is publicly available at https://***/siqixu/MRCIP.
In this article we develop an estimation method based on the augmented data scheme and em/Sem (Stochastic em) algorithms for fitting one-parameter probit (Rasch) IRT (Item Response Theory) models. Instead of using the...
详细信息
In this article we develop an estimation method based on the augmented data scheme and em/Sem (Stochastic em) algorithms for fitting one-parameter probit (Rasch) IRT (Item Response Theory) models. Instead of using the S steps of the Sem algorithm, that is, instead of simulating values for the unobserved variables (augmented data and the latent traits), we consider the conditional expectations of a set of unobserved variables on the other set of unobserved variables, the current estimates of the parameters and the observed data, based on the full conditional distributions from the Gibbs sampling algorithm. Our method, named the CADem algorithm (conditional augmented data em), presents straightforward E steps, which avoid the need to evaluate the usual integrals, also facilitating the M steps, without the need to use numerical methods of optimization. We use the CADem algorithm to obtain both maximum likelihood estimates and maximum a posteriori estimates of the difficulty parameters for the one-parameter probit (Rasch) model. Also, we obtain estimates for the latent traits, based on conditional expectations. In addition, we show how to calculate the associated standard errors. Some directions are provided to extend our approach to other IRT models. In this respect, we perform a simulation study to compare the estimation methods. The results indicated that our approach is quite comparable to the usual marginal maximum likelihood (MML) and Gibbs sampling methods (GS) in terms of parameter recovery. However, CADem is as fast as MML and as flexible as GS.
In this paper, we consider the four-parameter bivariate generalized exponential distribution proposed by Kundu and Gupta [Bivariate generalized exponential distribution, J. Multivariate Anal. 100 (2009), pp. 581-593] ...
详细信息
In this paper, we consider the four-parameter bivariate generalized exponential distribution proposed by Kundu and Gupta [Bivariate generalized exponential distribution, J. Multivariate Anal. 100 (2009), pp. 581-593] and propose an expectation-maximization algorithm to find the maximum-likelihood estimators of the four parameters under random left censoring. A numerical experiment is carried out to discuss the properties of the estimators obtained iteratively.
暂无评论