We propose a family of tests to assess the goodness of fit of a high dimensional generalizedlinear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific no...
详细信息
We propose a family of tests to assess the goodness of fit of a high dimensional generalizedlinear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalizedlinear model. This can be achieved by predicting this signal from the residuals by using modern powerful regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalizedlinear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness of fit in logistic regression models. Software implementing the methodology is available in the R package GRPtests.
Motivated by recent works on the high-dimensional logistic regression, we establish that the existence of the maximum likelihood estimate exhibits a phase transition for a wide range of generalized linear models with ...
详细信息
Motivated by recent works on the high-dimensional logistic regression, we establish that the existence of the maximum likelihood estimate exhibits a phase transition for a wide range of generalized linear models with binary outcome and elliptical covariates. This extends a previous result of Candes and Sur who proved the phase transition for the logistic regression with Gaussian covariates. Our result reveals a rich structure in the phase transition phenomenon, which is simply overlooked by Gaussianity. The main tools for deriving the result are data separation, convex geometry and stochastic approximation. We also conduct simulation studies to corroborate our theoretical findings, and explore other features of the problem.
Nowadays, it has become increasingly common to store large-scale data sets distributedly across a great number of clients. The aim of the study is to develop a distributed estimator for generalized linear models (GLMs...
详细信息
Nowadays, it has become increasingly common to store large-scale data sets distributedly across a great number of clients. The aim of the study is to develop a distributed estimator for generalized linear models (GLMs) in the "large n, diverging p(n)" framework with a weak assumption on the number of clients. When the dimension diverges at the rate of o(root n), the asymptotic efficiency of the global maximum likelihood estimator (MLE), the one-step MLE, and the aggregated estimating equation (AEE) estimator for GLMs are established. A novel distributed estimator is then proposed with two rounds of communication. It has the same asymptotic efficiency as the global MLE under p(n) = o(root n). The assumption on the number of clients is more relaxed than that of the AEE estimator and the proposed method is thus more practical for real-world applications. Simulations and a case study demonstrate the satisfactory finite-sample performance of the proposed estimator. (C) 2020 Elsevier B.V. All rights reserved.
Ridge estimators regularize the squared Euclidean lengths of parameters. Such estimators are mathematically and computationally attractive but involve tuning parameters that need to be calibrated. It is shown that rid...
详细信息
Ridge estimators regularize the squared Euclidean lengths of parameters. Such estimators are mathematically and computationally attractive but involve tuning parameters that need to be calibrated. It is shown that ridge estimators can be modified such that tuning parameters can be avoided altogether, and the resulting estimator can improve on the prediction accuracies of standard ridge estimators combined with cross-validation. (C) 2021 Elsevier B.V. All rights reserved.
In the analysis of clustered or hierarchical data, a variety of statistical techniques can be applied. Most of these techniques have assumptions that are crucial to the validity of their outcome. Mixed models rely on ...
详细信息
In the analysis of clustered or hierarchical data, a variety of statistical techniques can be applied. Most of these techniques have assumptions that are crucial to the validity of their outcome. Mixed models rely on the correct specification of the random effects structure. generalized estimating equations are most efficient when the working correlation form is chosen correctly and are not feasible when the within-subject variable is non-factorial. Assumptions and limitations of another common approach, ANOVA for repeated measurements, are even more worrisome: listwise deletion when data are missing, the sphericity assumption, inability to model an unevenly spaced time variable and time-varying covariates, and the limitation to normally distributed dependent variables. This paper introduces ClusterBootstrap, an R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap (GLMCB). Being a bootstrap method, the technique is relatively assumption-free, and it has already been shown to be comparable, if not superior, to GEE in its performance. The paper has three goals. First, GLMCB will be introduced. Second, there will be an empirical example, using the ClusterBootstrap package for a Gaussian and a dichotomous dependent variable. Third, GLMCB will be compared to mixed models in a Monte Carlo experiment. Although GLMCB can be applied to a multitude of hierarchical data forms, this paper discusses it in the context of the analysis of repeated measurements or longitudinal data. It will become clear that the GLMCB is a promising alternative to mixed models and the ClusterBootstrap package an easy-to-use R implementation of the technique.
We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalizedlinear estimation problems and variants. This represents a substantial e...
详细信息
It is known that collinearity among the explanatory variables in generalized linear models (GLMs) inflates the variance of maximum likelihood estimators. To overcome multicollinearity in GLMs, ordinary ridge estimator...
详细信息
It is known that collinearity among the explanatory variables in generalized linear models (GLMs) inflates the variance of maximum likelihood estimators. To overcome multicollinearity in GLMs, ordinary ridge estimator and restricted estimator were proposed. In this study, a restricted ridge estimator is introduced by unifying the ordinary ridge estimator and the restricted estimator in GLMs and its mean squared error (MSE) properties are discussed. The MSE comparisons are done in the context of first-order approximated estimators. The results are illustrated by a numerical example and two simulation studies are conducted with Poisson and binomial responses.
Variable selection is currently an important research topic under both frequentist and Bayesian framework. While most developments in Bayesian model selection literature are based on a local prior on regression parame...
详细信息
Variable selection is currently an important research topic under both frequentist and Bayesian framework. While most developments in Bayesian model selection literature are based on a local prior on regression parameters, a nonlocal prior for model selection can be also used. In this article, we extend nonlocal prior approach to logistic regression and to generalized linear models. Laplace approximation is used in implementation to avoid integration in the likelihood. A convergence rate is derived under some regularity conditions. The selection based on a nonlocal prior eliminates unnecessary variables and recommends a simple model. The method is validated by simulation study and illustrated by a real data example. (C) 2018 Elsevier B.V. All rights reserved.
Structured sparsity has recently been a very popular technique to deal with the high-dimensional data. In this paper, we mainly focus on the theoretical problems for the overlapping group structure of generalized line...
详细信息
Structured sparsity has recently been a very popular technique to deal with the high-dimensional data. In this paper, we mainly focus on the theoretical problems for the overlapping group structure of generalized linear models (GLMs). Although the overlapping group lasso method for GLMs has been widely applied in some applications, the theoretical properties about it are still unknown. Under some general conditions, we presents the oracle inequalities for the estimation and prediction error of overlapping group Lasso method in the generalizedlinear model setting. Then, we apply these results to the so-called Logistic and Poisson regression models. It is shown that the results of the Lasso and group Lasso procedures for GLMs can be recovered by specifying the group structures in our proposed method. The effect of overlap and the performance of variable selection of our proposed method are both studied by numerical simulations. Finally, we apply our proposed method to two gene expression data sets: the p53 data and the lung cancer data.
Due to increasing discoveries of biomarkers and observed diversity among patients, there is growing interest in personalized medicine for the purpose of increasing the well-being of patients (ethics) and extending hum...
详细信息
Due to increasing discoveries of biomarkers and observed diversity among patients, there is growing interest in personalized medicine for the purpose of increasing the well-being of patients (ethics) and extending human life. In fact, these biomarkers and observed heterogeneity among patients are useful covariates that can be used to achieve the ethical goals of clinical trials and improving the efficiency of statistical inference. Covariate-adjusted response-adaptive (CARA) design was developed to use information in such covariates in randomization to maximize the well-being of participating patients as well as increase the efficiency of statistical inference at the end of a clinical trial. In this paper, we establish conditions for consistency and asymptotic normality of maximum likelihood (ML) estimators of generalized linear models (GLM) for a general class of adaptive designs. We prove that the ML estimators are consistent and asymptotically follow a multivariate Gaussian distribution. The efficiency of the estimators and the performance of response-adaptive (RA), CARA, and completely randomized (CR) designs are examined based on the well-being of patients under a logit model with categorical covariates. Results from our simulation studies and application to data from a clinical trial on stroke prevention in atrial fibrillation (SPAF) show that RA designs lead to ethically desirable outcomes as well as higher statistical efficiency compared to CARA designs if there is no treatment by covariate interaction in an ideal model. CARA designs were however more ethical than RA designs when there was significant interaction.
暂无评论