版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Connecticut Dept Stat Storrs CT 06269 USA Cornell Univ Dept Biol Stat & Computat Biol Ithaca NY USA Cornell Univ Dept Stat Sci Ithaca NY USA
出 版 物:《APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY》 (商业与工业应用随机模型)
年 卷 期:2018年第34卷第6期
页 面:949-961页
核心收录:
学科分类:1201[管理学-管理科学与工程(可授管理学、工学学位)] 07[理学] 070104[理学-应用数学] 0714[理学-统计学(可授理学、经济学学位)] 0701[理学-数学]
基 金:National Science Foundation [DMS 1612625, DMS 1611893] National Institutes of Health [U19 AI111143]
主 题:EM algorithm generalized linear models random forest support vector machines variable selection
摘 要:We present a two-step approach to classification problems in the large P, small N setting, where the number of predictors may be larger than the sample size. We assume that the association between the predictors and the class variable has an approximate linear-logistic form, but we allow the class boundaries to be nonlinear. We further assume that the number of true predictors is relatively small. In the first step, we use a binomial generalized linear model to identify which predictors are associated with each class and then restrict the data set to these predictors and run a nonlinear classifier, such as a random forest or a support vector machine. We show that, without the variable screening step, the classification performance of both the random forest and support vector machine is degraded when many among the P predictors are not related to the class.