In generalized linear models (GLMs), measures of lack of fit are typically defined as the deviance between two nested models, and a deviance-based R-2 is commonly used to evaluate the fit. In this paper, we extend dev...
详细信息
In generalized linear models (GLMs), measures of lack of fit are typically defined as the deviance between two nested models, and a deviance-based R-2 is commonly used to evaluate the fit. In this paper, we extend deviance measures to mixtures of GLMs, whose parameters are estimated by maximum likelihood (ML) via the EM algorithm. Such measures are defined both locally, i.e., at cluster-level, and globally, i.e., with reference to the whole sample. At the cluster-level, we propose a normalized two-term decomposition of the local deviance into explained, and unexplained local deviances. At the sample-level, we introduce an additive normalized decomposition of the total deviance into three terms, where each evaluates a different aspect of the fitted model: (1) the cluster separation on the dependent variable, (2) the proportion of the total deviance explained by the fitted model, and (3) the proportion of the total deviance which remains unexplained. We use both local and global decompositions to define, respectively, local and overall deviance R-2 measures for mixtures of GLMs, which we illustrate-for Gaussian, Poisson and binomial responses-by means of a simulation study. The proposed fit measures are then used to assess, and interpret clusters of COVID-19 spread in Italy in two time points.
Predictive classification considered in this paper concerns the problem of identifying subgroups based on a continuous biomarker through estimation of an unknown cutpoint and assessing whether these subgroups differ i...
详细信息
Predictive classification considered in this paper concerns the problem of identifying subgroups based on a continuous biomarker through estimation of an unknown cutpoint and assessing whether these subgroups differ in treatment effect relative to some clinical outcome. The problem is considered under a generalizedlinear model framework for clinical out-comes and formulated as testing the significance of the interaction between the treatment and the subgroup indicator. When the main effect of the subgroup indicator does not exist, the cutpoint is non-identifiable under the null. Existing procedures are not adaptive to the identifiability issue, and do not work well when the main effect is small. In this work, we pro-pose profile score-type and Wald-type test statistics, and further m-out-of -n bootstrap techniques to obtain their critical values. The proposed proce-dures do not rely on the knowledge about the model identifiability, and we establish their asymptotic size validity and study the power under local alternatives in both cases. Further, we show that the standard bootstrap is inconsistent for the non-identifiable case. Simulation results corroborate our theory, and the proposed method is applied to a dataset from a clinical trial on advanced colorectal cancer.
This research paper provides a comprehensive analysis of three distinct generalized linear models (GLMs): the traditional linear regression, the Poisson GLM, and the Poisson-Inverse Gaussian GLM. The study applies the...
详细信息
ISBN:
(纸本)9783031686276;9783031686283
This research paper provides a comprehensive analysis of three distinct generalized linear models (GLMs): the traditional linear regression, the Poisson GLM, and the Poisson-Inverse Gaussian GLM. The study applies these models to the domain of Supply Chain Management for product demand modeling. To evaluate the goodness of fit of our models, we assess them by comparing their performance against the associated deviance function. Our findings indicate that the Poisson-Inverse Gaussian GLM outperforms both the Poisson GLM and the linear regression model in terms of goodness of fit.
generalized additive models (GAMs) are a leading model class for interpretable machine learning. GAMs were originally defined with smooth shape functions of the predictor variables and trained using smoothing splines....
详细信息
generalized additive models (GAMs) are a leading model class for interpretable machine learning. GAMs were originally defined with smooth shape functions of the predictor variables and trained using smoothing splines. Recently, tree-based GAMs where shape functions are gradient-boosted ensembles of bagged trees were proposed, leaving the door open for the estimation of a broader class of shape functions (e.g. Explainable Boosting Machine (EBM)). In this paper, we introduce a competing three-step GAM learning approach where we combine (i) the knowledge of the way to split the covariates space brought by an additive tree model (ATM), (ii) an ensemble of predictive linear scores derived from generalized linear models (GLMs) using a binning strategy based on the ATM, and (iii) a final GLM to have a prediction model that ensures auto-calibration. Numerical experiments illustrate the competitive performances of our approach on several datasets compared to GAM with splines, EBM, or GLM with binarsity penalization. A case study in trade credit insurance is also provided.
For massive data, subsampling techniques are popular to mitigate computational burden by reducing the data size. In a subsampling approach, subsampling probabilities for each data point are specified to obtain an info...
详细信息
For massive data, subsampling techniques are popular to mitigate computational burden by reducing the data size. In a subsampling approach, subsampling probabilities for each data point are specified to obtain an informative sub-data, and then estimates based on the sub-data are obtained to approximate estimates from the full data. Assigning subsampling probabilities based on minimization of the asymptotic mean squared error of the estimator from a general subsample (A-optimality criterion) is a popular approach, however, it is still computationally demanding to calculate the probabilities under this setting. To efficiently approximate the A-optimal subsampling probabilities for generalized linear models, randomized algorithms are proposed. To develop the algorithms, the Johnson-Lindenstrauss Transform and Subsampled Randomized Hadamard Transform are used. Additionally, optimal subsampling probabilities are derived for the Gaussian linear model in the case where both the regression coefficients and dispersion parameter are of interest, and algorithms are developed to approximate the optimal subsampling probabilities. Simulation studies indicate that the estimators based on the developed algorithms have excellent performance for statistical inference and have substantial savings in computing time compared to the direct calculation of the A-optimal subsampling probabilities.(c) 2021 EcoSta Econometrics and Statistics. Published by Elsevier B.V. All rights reserved.
We consider the sparsification of sums F : R-n -> R+ where F(x) = f(1)() + ... + f(m)() for vectors a(1), . . . ,a(m) is an element of R-n and functions f(1), . . . , f(m) : R -> R+. We show that (1+epsilon)-app...
详细信息
ISBN:
(纸本)9798400703836
We consider the sparsification of sums F : R-n -> R+ where F(x) = f(1)(< a(1), x >) + ... + f(m)(< a(m),x >) for vectors a(1), . . . ,a(m) is an element of R-n and functions f(1), . . . , f(m) : R -> R+. We show that (1+epsilon)-approximate sparsifiers of F with support size n/epsilon(2) (log n/epsilon)(O(1)) exist whenever the functions f(1), . . . , f(m) are symmetric, monotone, and satisfy natural growth bounds. Additionally, we give efficient algorithms to compute such a sparsifier assuming each f(i) can be evaluated efficiently. Our results generalize the classical case of l(p) sparsification, where f(i)(z) = vertical bar z vertical bar(p), for p is an element of (0, 2], and give the first near-linear size sparsifiers in the well-studied setting of the Huber loss function and its generalizations, e.g., f(i)(z) = min{vertical bar z vertical bar(p), vertical bar z vertical bar(2)} for 0 < p <= 2. Our sparsification algorithm can be applied to give near-optimal reductions for optimizing a variety of generalized linear models including l(p) regression for p is an element of (1, 2] to high accuracy, via solving (logn)(O(1)) sparse regression instances with m <= n(log n)(O(1)), plus runtime proportional to the number of nonzero entries in the vectors a(1), . . . , a(m).
We study the problem of recovering an unknownsignal x givenmeasurements obtained from a generalizedlinear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator (x) over cap (L) a...
详细信息
We study the problem of recovering an unknownsignal x givenmeasurements obtained from a generalizedlinear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator (x) over cap (L) and a spectral estimator (x) over cap (s). The former is a data-dependent linear combination of the columns of the measurement matrix, and its analysis is quite simple. The latter is the principal eigenvector of a data-dependent matrix, and a recent line of work has studied its performance. In this paper, we show howto optimally combine (x) over cap (L) and (x) over cap (s). At the heart of our analysis is the exact characterization of the empirical joint distribution of (x, (x) over cap (L), (x) over cap (s)) in the high-dimensional limit. This allows us to compute the Bayes-optimal combination of (x) over cap (L) and (x) over cap (s), given the limiting distribution of the signal x. When the distribution of the signal is Gaussian, then the Bayes-optimal combination has the form. (x) over cap (L) + (x) over cap (s) and we derive the optimal combination coefficient. In order to establish the limiting distribution of ( x, (x) over cap (L), (x) over cap (s)), we design and analyze an approximate message passing algorithm whose iterates give (x) over cap (L) and approach (x) over cap (s). Numerical simulations demonstrate the improvement of the proposed combination with respect to the two methods considered separately.
In this paper, we study the problem of estimating smooth generalized linear models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to ac...
详细信息
In this paper, we study the problem of estimating smooth generalized linear models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to access additional public but unlabeled data. In the first part of the paper, we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein's lemma, we present an (ε, δ)-NLDP algorithm for GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an ℓ2-norm estimation error of α (with high probability) is O(pα-2) and Õ(p3α-2ε-2) respectively, where p is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in α-1, or exponential in p sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded ℓ1-norm. Based on a variant of Stein's lemma, we propose an (ε, δ)-NLDP algorithm for GLMs whose sample complexity of public and private data to achieve an ℓ∞-norm estimation error of α is O(p2α-2) and Õ(p2α-2ε-2) respectively, under some mild assumptions and if α is not too small (i.e., α≥Ω(1/√p)). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets. To our best knowledge, this is the first paper showing the existence of efficient and effective algorithms for GLMs and non-linear regressions in the NLDP model with unlabeled public data.
Post-selection inference has been an active research topic recently. A lot of work provided different ways to solve practical problems in many fields such as medicine, finance, and so on. In particular, post-selection...
详细信息
Post-selection inference has been an active research topic recently. A lot of work provided different ways to solve practical problems in many fields such as medicine, finance, and so on. In particular, post-selection inference under the linear model is widely discussed. We extend it to generalizedlinear model and present new approaches for post-selection inference for penalized least squares method. The core of this framework is the distribution function of the post-selection estimation conditioned on the selection event. Then, lasso and elastic net are used to select models to construct the effective confidence interval of the selected coefficient. The theoretical results and the numerical comparisons show that our methods are better than the existing ones. Finally, the proposed methods are applied to the analysis of real data sets.
We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples (x, y) where y is a noisy...
详细信息
We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples (x, y) where y is a noisy measurement of g(w* center dot x). In particular, y = g(w* center dot x) + xi + epsilon where. is the oblivious noise drawn independently of x, satisfying Pr[xi = 0] =>= o(1), and epsilon similar to N(0, sigma(2)). Our goal is to accurately recover a function g(w center dot x) with arbitrarily small error when compared to the true values g(w* center dot x), rather than the noisy measurements y. We present an algorithm that tackles the problem in its most general distribution-independent setting, where the solution may not be identifiable. The algorithm is designed to return the solution if it is identifiable, and otherwise return a small list of candidates, one of which is close to the true solution. Furthermore, we characterize a necessary and sufficient condition for identifiability, which holds in broad settings. The problem is identifiable when the quantile at which xi + epsilon = 0 is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated g(w* center dot x) + A for some real number A, while also having large error when compared to g(w* center dot x). This is the first result for GLM regression which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression with oblivious noise, and giving algorithms under more restrictive assumptions.
暂无评论