We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multip...
详细信息
We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid *** the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.
The goal of this paper is to address the issue of nonlinear regression with outliers, possibly in high dimension, without specifying the form of the link function and under a parametric approach. Nonlinearity is handl...
详细信息
The goal of this paper is to address the issue of nonlinear regression with outliers, possibly in high dimension, without specifying the form of the link function and under a parametric approach. Nonlinearity is handled via an underlying mixture of affine regressions. Each regression is encoded in a joint multivariate Student distribution on the responses and covariates. This joint modeling allows the use of an inverse regression strategy to handle the high dimensionality of the data, while the heavy tail of the Student distribution limits the contamination by outlying data. The possibility to add a number of latent variables similar to factors to the model further reduces its sensitivity to noise or model misspecification. The mixture model setting has the advantage of providing a natural inference procedure using an em algorithm. The tractability and flexibility of the algorithm are illustrated in simulations and real high-dimensional data with good performance that compares favorably with other existing methods. (C) 2017 Elsevier Inc. All rights reserved.
Deciding the number of clusters k is one of the most difficult problems in cluster analysis. For this purpose, complexity-penalized likelihood approaches have been introduced in model-based clustering, such as the wel...
详细信息
Deciding the number of clusters k is one of the most difficult problems in cluster analysis. For this purpose, complexity-penalized likelihood approaches have been introduced in model-based clustering, such as the well-known Bayesian information criterion and integrated complete likelihood criteria. However, the classification/mixture likelihoods considered in these approaches are unbounded without any constraint on the cluster scatter matrices. Constraints also prevent traditional em and Cem algorithms from being trapped in (spurious) local maxima. Controlling the maximal ratio between the eigenvalues of the scatter matrices to be smaller than a fixed constant c >= 1 is a sensible idea for setting such constraints. A new penalized likelihood criterion which takes into account the higher model complexity that a higher value of c entails is proposed. Based on this criterion, a novel and fully automated procedure, leading to a small ranked list of optimal (k, c) couples is provided. A new plot called "car-bike," which provides a concise summary of the solutions, is introduced. The performance of the procedure is assessed both in empirical examples and through a simulation study as a function of cluster overlap. Supplementary materials for the article are available online.
An efficient and accurate numerical approximation methodology useful for obtaining the observed information matrix and subsequent asymptotic covariance matrix when fitting models with the em algorithm is presented. Th...
详细信息
An efficient and accurate numerical approximation methodology useful for obtaining the observed information matrix and subsequent asymptotic covariance matrix when fitting models with the em algorithm is presented. The numerical approximation approach is compared to existing algorithms intended for the same purpose, and the computational benefits and accuracy of this new approach are highlighted. Instructive and real-world examples are included to demonstrate the methodology concretely, properties of the estimator are discussed in detail, and a Monte Carlo simulation study is included to investigate the behaviour of a multi-parameter item response theory model using three competing finite-difference algorithms.
The problem of multicollinearity among predictor variables is a frequent issue in longitudinal data analysis. In this context, this paper proposes a mixed ridge regression model via shrinkage methods to analyze such d...
详细信息
The problem of multicollinearity among predictor variables is a frequent issue in longitudinal data analysis. In this context, this paper proposes a mixed ridge regression model via shrinkage methods to analyze such data. Furthermore, in view of obtaining more efficient estimators, we propose preliminary and Stein-type estimators using prior information for fixed-effects parameters. The model parameters are estimated via the em algorithm. A simulation study is also presented to assess the performance of the estimators under different estimation methods. An application to the HIV data is also illustrated.
The analysis of quantitative trait loci (QTLs) aims at mapping and estimating the positions and effects of the genes that may affect the quantitative trait, and evaluating the relationship between the gene variation a...
详细信息
The analysis of quantitative trait loci (QTLs) aims at mapping and estimating the positions and effects of the genes that may affect the quantitative trait, and evaluating the relationship between the gene variation and the phenotype. In existing studies, most methods mainly focus on the association/linkage between multiple gene loci and one trait, in which some useful joint information of multiple traits may be ignored. In this paper, we proposed a method of simultaneously estimating all QTL parameters in the framework of multiple-trait multiple-interval mapping. Simulation results show that in accuracy aspect, the proposed method outperforms an existing method for mapping multiple traits. A real example is also provided to validate the performance of the new method.
Measurement error models constitute a wide class of models that include linear and nonlinear regression models. They are very useful to model many real-life phenomena, particularly in the medical and biological areas....
详细信息
Measurement error models constitute a wide class of models that include linear and nonlinear regression models. They are very useful to model many real-life phenomena, particularly in the medical and biological areas. The great advantage of these models is that, in some sense, they can be represented as mixed effects models, allowing us to implement wellknown techniques, like the em-algorithm for the parameter estimation. In this paper, we consider a class of multivariate measurement error models where the observed response and/or covariate are not fully observed, i.e., the observations are subject to certain threshold values below or above which the measurements are not quantifiable. Consequently, these observations are considered censored. We assume a Student-t distribution for the unobserved true values of the mismeasured covariate and the error term of the model, providing a robust alternative for parameter estimation. Our approach relies on a likelihood-based inference using an em-type algorithm. The proposed method is illustrated through some simulation studies and the analysis of an AIDS clinical trial dataset.
In this article, we suggest a new statistical approach considering survival heterogeneity as a breakpoint model in an ordered sequence of time-to-event variables. The survival responses need to be ordered according to...
详细信息
In this article, we suggest a new statistical approach considering survival heterogeneity as a breakpoint model in an ordered sequence of time-to-event variables. The survival responses need to be ordered according to a numerical covariate. Our estimation method will aim at detecting heterogeneity that could arise through the ordering covariate. We formally introduce our model as a constrained Hidden Markov Model, where the hidden states are the unknown segmentation (breakpoint locations) and the observed states are the survival responses. We derive an efficient Expectation-Maximization framework for maximizing the likelihood of this model for a wide range of baseline hazard forms (parametrics or nonparametric). The posterior distribution of the breakpoints is also derived, and the selection of the number of segments using penalized likelihood criterion is discussed. The performance of our survival breakpoint model is finally illustrated on a diabetes dataset where the observed survival times are ordered according to the calendar time of disease onset.
Complex interactions between entities are often represented as edges in a network. In practice, the network is often constructed from noisy measurements and inevitably contains some errors. In this paper we consider t...
详细信息
Complex interactions between entities are often represented as edges in a network. In practice, the network is often constructed from noisy measurements and inevitably contains some errors. In this paper we consider the problem of estimating a network from multiple noisy observations where edges of the original network are recorded with both false positives and false negatives. This problem is motivated by neuroimaging applications where brain networks of a group of patients with a particular brain condition could be viewed as noisy versions of an unobserved true network corresponding to the disease. The key to optimally leveraging these multiple observations is to take advantage of network structure, and here we focus on the case where the true network contains communities. Communities are common in real networks in general and in particular are believed to be presented in brain networks. Under a community structure assumption on the truth, we derive an efficient method to estimate the noise levels and the original network, with theoretical guarantees on the convergence of our estimates. We show on synthetic networks that the performance of our method is close to an oracle method using the true parameter values, and apply our method to fMRI brain data, demonstrating that it constructs stable and plausible estimates of the population network.
Missing covariates data is a common issue in generalized linear models (GLMs). A model-based procedure arising from properly specifying joint models for both the partially observed covariates and the corresponding mis...
详细信息
Missing covariates data is a common issue in generalized linear models (GLMs). A model-based procedure arising from properly specifying joint models for both the partially observed covariates and the corresponding missing indicator variables represents a sound and flexible methodology, which lends itself to maximum likelihood estimation as the likelihood function is available in computable form. In this paper, a novel model-based methodology is proposed for the regression analysis of GLMs when the partially observed covariates are categorical. Pair-copula constructions are used as graphical tools in order to facilitate the specification of the high-dimensional probability distributions of the underlying missingness components. The model parameters are estimated by maximizing the weighted loglikelihood function by using an em algorithm. In order to compare the performance of the proposed methodology with other well-established approaches, which include complete-cases and multiple imputation, several simulation experiments of Binomial, Poisson and Normal regressions are carried out under both missing at random and non-missing at random mechanisms scenarios. The methods are illustrated by modeling data from a stage III melanoma clinical trial. The results show that the methodology is rather robust and flexible, representing a competitive alternative to traditional techniques.
暂无评论