In this paper, we study the problem of estimating smooth generalized linear models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to ac...
详细信息
In this paper, we study the problem of estimating smooth generalized linear models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to access additional public but unlabeled data. In the first part of the paper, we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein's lemma, we present an (epsilon, delta)-NLDP algorithm for GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an l(2)-norm estimation error of alpha (with high probability) is O(p alpha(-2)) and (O) over tilde (p(3)alpha(-2) epsilon(-2)) respectively, where p is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in alpha-1, or exponential in p sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded l(1)-norm. Based on a variant of Stein's lemma, we propose an (epsilon, delta)-NLDP algorithm for GLMs whose sample complexity of public and private data to achieve an l(infinity)-norm estimation error of alpha is O(p(2)alpha(-2)) and (O) over tilde (p(2)alpha(-2) epsilon(-2)) respectively, under some mild assumptions and if alpha is not too small (i.e., alpha >= Omega( 1/root p )). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets. To our best knowledge, this is the first paper showing the existence of efficient and effective algorithms for GLMs and non-linear regressions in the NLDP model with unlabeled public
generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be prob...
详细信息
generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalizedlinear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalizedlinear model across almost 1.7 billion observations on a personal computer in 25 min.
In this paper, our goal is to enhance the interpretability of generalized linear models by identifying the most relevant interactions between categorical predictors. Searching for interaction effects can quickly becom...
详细信息
In this paper, our goal is to enhance the interpretability of generalized linear models by identifying the most relevant interactions between categorical predictors. Searching for interaction effects can quickly become a highly combinatorial, and thus computationally costly, problem when we have many categorical predictors or even a few of them but with many categories. Moreover, the estimation of coefficients requires large training samples with enough observations for each interaction between categories. To address these bottlenecks, we propose to find a reduced representation for each categorical predictor as a binary predictor, where categories are clustered based on a dissimilarity. We provide a collection of binarized representations for each categorical predictor, where the dissimilarity takes into account information from the main effects and the interactions. The choice of the binarized predictors representing the categorical predictors is made with a novel heuristic procedure that is guided by the accuracy of the so-called binarized model. We test our methodology on both real-world and simulated data, illustrating that, without damaging the out-of-sample accuracy, our approach trains sparse models including only the most relevant interactions between categorical predictors.
Many research fields involve count data with zero inflation. A commonly chosen model for analysing a relationship between predictors and a response variable in these scenarios is a zero-inflated generalizedlinear mod...
详细信息
Many research fields involve count data with zero inflation. A commonly chosen model for analysing a relationship between predictors and a response variable in these scenarios is a zero-inflated generalizedlinear model (GLM). This model is a mixture of a count-based GLM and a zero-inflation component, with a mixing proportion that determines the amount of excess zeroes. As the use of zero-inflated count models is rising, it is important to be able to conduct a power analysis to properly design studies with such models. In this paper, we propose a flexible method for power analysis with zero-inflated count models using Monte Carlo simulation. We have created the R package ZIPowerAnalysis, which can be used to easily conduct a power analysis for any designed study that will incorporate a zero-inflated count GLM.
Bayesian modeling provides a principled approach to quantifying uncertainty and has seen a surge of applications in recent years. Within the context of a Bayesian workflow, we are concerned with model selection for th...
详细信息
Bayesian modeling provides a principled approach to quantifying uncertainty and has seen a surge of applications in recent years. Within the context of a Bayesian workflow, we are concerned with model selection for the purpose of finding models that best explain the data or underlying data generating process. Since insight into the true process is rare, what remains is incomplete causal knowledge and model predictions of the data. This leads to the important question of when the use of prediction as a proxy for explanation for the purpose of model selection is valid. We approach this question by means of large-scale simulations of Bayesian generalized linear models where we investigate various causal and statistical misspecifications. Our results indicate that the use of prediction as proxy for explanation is valid and safe if the models under consideration are sufficiently consistent with the underlying causal structure of the true data generating process.
In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specif...
详细信息
In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility.
Consider the following generalizedlinear model (GLM) yi = h(x(i)(T) beta) + e(i), i = 1, 2,..., n, where h(.) is a continuous differentiable function, {e(i)} are independent identically distributed (i.i.d.) random va...
详细信息
Consider the following generalizedlinear model (GLM) yi = h(x(i)(T) beta) + e(i), i = 1, 2,..., n, where h(.) is a continuous differentiable function, {e(i)} are independent identically distributed (i.i.d.) random variables with zero mean and known variance sigma(2). Based on the penalized Lq-likelihood method of linear regression models, we apply the method to the GLM, and also investigate Oracle properties of the penalized Lq-likelihood estimator (PLqE). In order to show the robustness of the PLqE, we discuss influence function of the PLqE. Simulation results support the validity of our approach. Furthermore, it is shown that the PLqE is robust, while the penalized maximum likelihood estimator is not.
This paper focuses on decorrelated empirical likelihood -based inference for longitudinal data with ultrahigh -dimensional covariates. The primary issues we aim to address involve parameter estimation and hypothesis t...
详细信息
This paper focuses on decorrelated empirical likelihood -based inference for longitudinal data with ultrahigh -dimensional covariates. The primary issues we aim to address involve parameter estimation and hypothesis testing for a low -dimensional parameter of interest. Under the framework of the generalizedlinear model, we initially consider the within -subject correlation by linearizing the precision matrix with certain known matrices, which retains optimality even if the working correlated structure is misspecified. Coupled with the decorrelated matrix, we then eliminate the influence of nuisance parameters on the estimation procedure. The proposed approach not only yields more efficient estimators compared to generalized decorrelated estimating equations but also shares the same asymptotic variance as quadratic decorrelated inference function based methods. Furthermore, we define the decorrelated empirical loglikelihood ratio test statistic to assess the significance of regression coefficients. Finally, to evaluate the performance of the proposed procedure, we conduct simulation studies and apply it to a real data example.
In order to break the constraints and barriers caused by limited computing power in processing massive datasets, we propose an outcome dependent subsampling divide and conquer strategy in this paper. The proposed stra...
详细信息
In order to break the constraints and barriers caused by limited computing power in processing massive datasets, we propose an outcome dependent subsampling divide and conquer strategy in this paper. The proposed strategy can process data on multiple blocks in parallel and concentrate the computing resources of each block on regions with the most information. We develop a distributed statistical inference method and propose a computation-efficient algorithm in the generalized linear models for massive data. The proposed method only need to preserve some summary statistics from each data block and then use them to directly construct the proposed estimator. The asymptotic properties of the proposed method are established. Simulation studies and real data analysis are conducted to illustrate the merits of the proposed method.
A generalized case-control (GCC) study, like the standard case-control study, leverages outcome-dependent sampling (ODS) to extend to nonbinary responses. We develop a novel, unifying approach for analyzing GCC study ...
详细信息
A generalized case-control (GCC) study, like the standard case-control study, leverages outcome-dependent sampling (ODS) to extend to nonbinary responses. We develop a novel, unifying approach for analyzing GCC study data using the recently developed semiparametric extension of the generalizedlinear model (GLM), which is substantially more robust to model misspecification than existing approaches based on parametric GLMs. For valid estimation and inference, we use a conditional likelihood to account for the biased sampling design. We describe analysis procedures for estimation and inference for the semiparametric GLM under a conditional likelihood, and we discuss problems with estimation and inference under a conditional likelihood when the response distribution is misspecified. We demonstrate the flexibility of our approach over existing ones through extensive simulation studies, and we apply the methodology to an analysis of the Asset and Health Dynamics Among the Oldest Old study, which motives our research. The proposed approach yields a simple yet versatile solution for handling ODS in a wide variety of possible response distributions and sampling schemes encountered in practice.
暂无评论