In this paper, we investigate robust parameter estimation and variable selection for binary regression models withgrouped data. We investigate estimation procedures based on the minimum-distance approach. In particula...
详细信息
In this paper, we investigate robust parameter estimation and variable selection for binary regression models withgrouped data. We investigate estimation procedures based on the minimum-distance approach. In particular, we employ minimum Hellinger and minimum symmetric chi-squared distances criteria and propose regularized minimum-distance estimators. These estimators appear to possess a certain degree of automatic robustness against model misspecification and/or for potential outliers. We show that the proposed non-penalized and penalized minimum-distance estimators are efficient under the model and simultaneously have excellent robustness properties. We study their asymptotic properties such as consistency, asymptotic normality and oracle properties. Using Monte Carlo studies, we examine the small-sample and robustness properties of the proposed estimators and compare them with traditional likelihood estimators. We also study two real-data applications to illustrate our methods. The numerical studies indicate the satisfactory finite-sample performance of our procedures.
We consider finite mixtures of generalized linear models with binary output. We prove that cross moments (between the output and the regression variables) up to order three are sufficient to identify all parameters of...
详细信息
We consider finite mixtures of generalized linear models with binary output. We prove that cross moments (between the output and the regression variables) up to order three are sufficient to identify all parameters of the model. We propose a least-squares estimation method based on those moments and we prove the consistency and the Gaussian asymptotic behavior of the estimator. We provide simulation results and comparisons with likelihood methods. Numerical experiments were conducted using the R-package morpheus that we developed for our least-squares moment method and with the R-package flexmix for likelihood methods. We then give some possible extensions to finite mixtures of regressions with binary output including both continuous and categorical covariates, and possibly longitudinal data.
For a regression problem with a binary label response, we examine the problem of constructing confidence intervals for the label probability conditional on the features. In a setting where we do not have any informati...
详细信息
For a regression problem with a binary label response, we examine the problem of constructing confidence intervals for the label probability conditional on the features. In a setting where we do not have any information about the underlying distribution, we would ideally like to provide confidence intervals that are distribution free-that is, valid with no assumptions on the distribution of the data. Our results establish an explicit lower bound on the length of any distribution-free confidence interval, and construct a procedure that can approximately achieve this length. In particular, this lower bound is independent of the sample size and holds for all distributions with no point masses, meaning that it is not possible for any distribution-free procedure to be adaptive with respect to any type of special structure in the distribution.
BackgroundIn the last few decades, cumulative experimental researches have witnessed and verified the important roles of microRNAs (miRNAs) in the development of human complex diseases. Benefitting from the rapid grow...
详细信息
BackgroundIn the last few decades, cumulative experimental researches have witnessed and verified the important roles of microRNAs (miRNAs) in the development of human complex diseases. Benefitting from the rapid growth both in the availability of miRNA-related data and the development of various analysis methodologies, up until recently, some computational models have been developed to predict human disease related miRNAs, efficiently and *** this work, we proposed a computational model of Random Walk and binary regression-based MiRNA-Disease Association prediction (RWBRMDA). RWBRMDA extracted features for each miRNA from random walk with restart on the integrated miRNA similarity network for binary logistic regression to predict potential miRNA-disease associations. RWBRMDA obtained AUC of 0.8076 in the leave-one-out cross validation. Additionally, we carried out three different patterns of case studies on four human complex diseases. Specifically, Esophageal cancer and Prostate cancer were conducted as one kind of case study based on known miRNA-disease associations in HMDD v2.0 database. Out of the top 50 predicted miRNAs, 94 and 90% were respectively confirmed by recent experimental reports. To simulate new disease without known related miRNAs, the information of known Breast cancer related miRNAs was removed. As a result, 98% of the top 50 predicted miRNAs for Breast cancer were confirmed. Lymphoma, the verified ratio of which was 88%, was used to assess the prediction robustness of RWBRMDA based on the association records in HMDD v1.0 *** anticipated that RWBRMDA could benefit the future experimental investigations about the relation between human disease and miRNAs by generating promising and testable top-ranked miRNAs, and significantly reducing the effort and cost of identification works.
In binary regression, imbalanced data result from the presence of values equal to zero (or one) in a proportion that is significantly greater than the corresponding real values of one (or zero). In this work, we evalu...
详细信息
In binary regression, imbalanced data result from the presence of values equal to zero (or one) in a proportion that is significantly greater than the corresponding real values of one (or zero). In this work, we evaluate two methods developed to deal with imbalanced data and compare them to the use of asymmetric links. The results based on simulation study show, that correction methods do not adequately correct bias in the estimation of regression coefficients and that the models with power links and reverse power considered produce better results for certain types of imbalanced data. Additionally, we present an application for imbalanced data, identifying the best model among the various ones proposed. The parameters are estimated using a Bayesian approach, considering the Hamiltonian Monte-Carlo method, utilizing the No-U-Turn Sampler algorithm and the comparisons of models were developed using different criteria for model comparison, predictive evaluation and quantile residuals.
We construct confidence sets for the regression function in nonparametric binary regression with an unknown design density a nuisance parameter in the problem. These confidence sets are adaptive in L-2 loss over a con...
详细信息
We construct confidence sets for the regression function in nonparametric binary regression with an unknown design density a nuisance parameter in the problem. These confidence sets are adaptive in L-2 loss over a continuous class of Sobolev type spaces. Adaptation holds in the smoothness of the regression function, over the maximal parameter spaces where adaptation is possible, provided the design density is smooth enough. We identify two key regimes one where adaptation is possible, and one where some critical regions must be removed. We address related questions about goodness of fit testing and adaptive estimation of relevant infinite dimensional parameters.
In this paper, we propose a new estimation method for binary quantile regression and variable selection which can be implemented by an iteratively reweighted least square approach. In contrast to existing approaches, ...
详细信息
In this paper, we propose a new estimation method for binary quantile regression and variable selection which can be implemented by an iteratively reweighted least square approach. In contrast to existing approaches, this method is computationally simple, guaranteed to converge to a unique solution and implemented with standard software packages. We demonstrate our methods using Monte-Carlo experiments and then we apply the proposed method to the widely used work trip mode choice dataset. The results indicate that the proposed estimators work well in finite samples.
In many practical situations, it is desirable to predict binary ("yes"-"no") decisions made by people. The traditional approach to this prediction assumes that the utility linearly depends on the c...
详细信息
In many practical situations, it is desirable to predict binary ("yes"-"no") decisions made by people. The traditional approach to this prediction assumes that the utility linearly depends on the corresponding parameters, and that the distribution of the difference between predicted and actual utility is symmetric - usually normal or logistic;the corresponding techniques are known as, correspondingly, probit and logit. In real life, utility often non-linearly depends on the parameters, and the corresponding distributions are asymmetric (skewed). There are techniques for dealing with non-linearity;the most widely used such technique - called kink regression - uses piece-wise linear approximations to the utility. There are also techniques that take into account the distribution's asymmetry;usually, they are based on using special asymmetric distributions: skew-normal and skew-logistic. In this paper, we show how these two techniques to be combined to take into account both non-linearity and asymmetry. On a real-life example, we show that the new technique indeed leads to a better description of human binary decision-making.
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant e...
详细信息
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal;for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies.
regression by composition is a new and flexible toolkit for building and understanding statistical models. Focusing here on regression models for a binary outcome conditional on a binary treatment and other covariates...
详细信息
regression by composition is a new and flexible toolkit for building and understanding statistical models. Focusing here on regression models for a binary outcome conditional on a binary treatment and other covariates, we motivate the need for regression by composition. We do this first by exhibiting-using L'Abb & eacute;plots-the families of relationships between untreated and treated conditional outcome risks that emerge from generalized linear models for many different link functions. These are compared with the relationships (between untreated and treated risks) that arise from mechanistic sufficient component cause models, which are first principles causal models for binary outcomes. By considering mechanistic models that allow for non-monotone causal effects and by allowing sufficient causes to be associated, we expand upon similar discussions in the recent literature. We discuss conditions under which commonly used statistical models for binary data, such as logistic regression, arise from mechanistic models where the sufficient causes are associated in a particular way, as well as other situations in which the statistical models arising do not correspond to a generalized linear model but can be naturally expressed as a regression by composition model.
暂无评论