Symbolic data analysis data has provided several advances in regression models concerning the type of symbolic variable. Due to the advantages of using symbolic polygonal data, this paper introduces a linear regressio...
详细信息
Symbolic data analysis data has provided several advances in regression models concerning the type of symbolic variable. Due to the advantages of using symbolic polygonal data, this paper introduces a linear regression approach for polygonal data based on the generalize linear model theory that provides a unified method to broad range of modeling problems for different types of response as asymmetric continuous and discrete. Ordinary polygonal residuals and a way for finding model inadequacies are presented. Moreover, a quality measure of fit for polygons is also proposed in this paper. Experimental evaluation results illustrate the usefulness of the proposed approach regarding synthetic and real polygonal data.
In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which s...
详细信息
In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its l(1)/l(2)-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and sources are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. Supplementary materials for this article are available online.
This article is concerned with the problem of variable selection and estimation for high dimensional generalized linear models. In this article, we introduce a general iteratively reweighted adaptive ridge regression ...
详细信息
This article is concerned with the problem of variable selection and estimation for high dimensional generalized linear models. In this article, we introduce a general iteratively reweighted adaptive ridge regression method (GAR). We show that the GAR estimator possesses oracle property and grouping effect. A data-driven parameter gamma is introduced in the GAR method to adapt the different cases of the true model. Then, such an adaptive parameter gamma is adequately taken into consideration to establish a gamma-dependent sufficient condition to guarantee the oracle property and the grouping effect. Furthermore, to apply the GAR method more efficiently, a coordinate-wise Newton algorithm is employed to successfully avoid the inverse matrix operation and the numerical instability caused by iteration. Extensive numerical simulation results show that the GAR method outperforms the commonly used methods, and the GAR method is tested on the gastric cancer dataset for further illustration.
Adaptive lasso penalized generalized linear models (GLMs) are a powerful tool for analyzing the high-dimensional sparse data where the classical linear or normal assumption is not met. In non-distributed environments,...
详细信息
Adaptive lasso penalized generalized linear models (GLMs) are a powerful tool for analyzing the high-dimensional sparse data where the classical linear or normal assumption is not met. In non-distributed environments, the estimation problem of adaptive lasso penalized GLMs is often solved by the coordinate descent based algorithm developed in Friedman, Hastie, and Tibshirani (2010), which has been well implemented in the R package glmnet. However, when applied to distributed big data, this algorithm is usually inflexible or even infeasible due to its non-parallel implementation, especially when the communication costs between the central and local machines are expensive, or the storage and computing capabilities of the central machine are insufficient. In this paper, we propose a new method, QAGLM-alasso, for the adaptive lasso penalized GLMs problem in distributed big data by applying the quadratic approximation representation of GLMs, and further develop a path-following algorithm for its estimation based on the Least Angle Regression (LARS). Theoretical analyses show that, under mild regularity conditions, the QAGLM-alasso enjoys the oracle property, and the obtained estimator is asymptotically equivalent to the original adaptive lasso. Simulation studies demonstrate that the new algorithm has similar estimation accuracy with glmnet, but is significantly faster than glmnet in distributed environments. We further illustrate the practical performance of the proposed method by analyzing a supersymmetric (SUSY) benchmark data set.
This article provides methods for flexibly capturing unobservable heterogeneity from longitudinal data in the context of an exponential family of distributions. The group memberships of individual units are left unspe...
详细信息
This article provides methods for flexibly capturing unobservable heterogeneity from longitudinal data in the context of an exponential family of distributions. The group memberships of individual units are left unspecified, and their heterogeneity is influenced by group-specific unobservable factor structures. The model includes, as special cases, probit, logit, and Poisson regressions with interactive fixed effects along with unknown group membership. We discuss a computationally efficient estimation method and derive the corresponding asymptotic theory. Uniform consistency of the estimated group membership is established. To test heterogeneous regression coefficients within groups, we propose a Swamy-type test that allows for unobserved heterogeneity. We apply the proposed method to the study of market structure of the taxi industry in New York City. Our method unveils interesting and important insights from large-scale longitudinal data that consist of over 450 million data points.
In this paper, we consider the application of penalized empirical likelihood to the high-dimensional generalized linear models with longitudinal data. Under regular conditions, it is shown that the penalized empirical...
详细信息
In this paper, we consider the application of penalized empirical likelihood to the high-dimensional generalized linear models with longitudinal data. Under regular conditions, it is shown that the penalized empirical likelihood has the oracle property. That is, with probability converging to one, the penalized empirical likelihood identifies the true model and estimates the nonzero coefficients as efficiently as if the sparsity of the true model was known in advance. Also, we find the asymptotic distribution of the penalized empirical likelihood ratio test statistic is the chi-square distribution. Some simulations and a real data analysis are conducted to illustrate the proposed method.
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method th...
详细信息
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method that aligns with the data collection scheme of streaming data. Online debiased lasso differs from offline debiased lasso in two important aspects. First, it updates component-wise confidence intervals of regression coefficients with only summary statistics of the historical data. Second, online debiased lasso adds an additional term to correct approximation errors accumulated throughout the online updating procedure. We show that our proposed online debiased estimators in generalized linear models are asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of our proposed online debiased lasso method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. Furthermore, we illustrate the application of our method with a high-dimensional text dataset.
Outcome-dependent sampling (ODS) is a commonly used class of sampling designs to increase estimation efficiency in settings where response information (and possibly adjuster covariates) is available, but the exposure ...
详细信息
Outcome-dependent sampling (ODS) is a commonly used class of sampling designs to increase estimation efficiency in settings where response information (and possibly adjuster covariates) is available, but the exposure is expensive and/or cumbersome to collect. We focus on ODS within the context of a two-phase study, where in Phase One the response and adjuster covariate information is collected on a large cohort that is representative of the target population, but the expensive exposure variable is not yet measured. In Phase Two, using response information from Phase One, we selectively oversample a subset of informative subjects in whom we collect expensive exposure information. Importantly, the Phase Two sample is no longer representative, and we must use ascertainment-correcting analysis procedures for valid inferences. In this paper, we focus on likelihood-based analysis procedures, particularly a conditional-likelihood approach and a full-likelihood approach. Whereas the full-likelihood retains incomplete Phase One data for subjects not selected into Phase Two, the conditional-likelihood explicitly conditions on Phase Two sample selection (ie, it is a "complete case" analysis procedure). These designs and analysis procedures are typically implemented assuming a known, parametric model for the response distribution. However, in this paper, we approach analyses implementing a novel semi-parametric extension to generalized linear models (SPGLM) to develop likelihood-based procedures with improved robustness to misspecification of distributional assumptions. We specifically focus on the common setting where standard GLM distributional assumptions are not satisfied (eg, misspecified mean/variance relationship). We aim to provide practical design guidance and flexible tools for practitioners in these settings.
Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this article, we consider multiple change point detection in the context of high-dimensional generaliz...
详细信息
Change point detection for high-dimensional data is an important yet challenging problem for many applications. In this article, we consider multiple change point detection in the context of high-dimensional generalized linear models, allowing the covariate dimension p to grow exponentially with the sample size n. The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with n. To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer's Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.
In a clinical trial, the responses to the new treatment may vary among patient subsets with different characteristics in a biomarker. It is often necessary to examine whether there is a cutpoint for the biomarker that...
详细信息
In a clinical trial, the responses to the new treatment may vary among patient subsets with different characteristics in a biomarker. It is often necessary to examine whether there is a cutpoint for the biomarker that divides the patients into two subsets of those with more favourable and less favourable responses. More generally, we approach this problem as a test of homogeneity in the effects of a set of covariates in generalizedlinear regression models. The unknown cutpoint results in a model with nonidentifiability and a nonsmooth likelihood function to which the ordinary likelihood methods do not apply. We first use a smooth continuous function to approximate the indicator function defining the patient subsets. We then propose a penalized likelihood ratio test to overcome the model irregularities. Under the null hypothesis, we prove that the asymptotic distribution of the proposed test statistic is a mixture of chi-squared distributions. Our method is based on established asymptotic theory, is simple to use, and works in a general framework that includes logistic, Poisson, and linear regression models. In extensive simulation studies, we find that the proposed test works well in terms of size and power. We further demonstrate the use of the proposed method by applying it to clinical trial data from the Digitalis Investigation Group (DIG) on heart failure.
暂无评论