版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:NYU Courant Inst Math Sci New York NY 10012 USA Univ Calif Berkeley Dept Stat Berkeley CA 94720 USA Univ Calif Berkeley Dept EECS Berkeley CA 94720 USA Univ Waterloo Dept Stat Waterloo ON Canada Univ Waterloo Dept Actuarial Sci & Appl Math Waterloo ON Canada
出 版 物:《JOURNAL OF MACHINE LEARNING RESEARCH》 (机器学习研究杂志)
年 卷 期:2021年第22卷第1期
页 面:1-51页
核心收录:
学科分类:08[工学] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:Miller Institute for Basic Research in Science Natural Sciences and Engineering Research Council of Canada (NSERC) Conseil de recherches en sciences naturelles et en genie du Canada (CRSNG) [RGPIN-2020-04597, DGECR-2020-00199]
主 题:stochastic gradient descent parameter estimation non-convex optimization supervised learning generalized linear models tensor PCA
摘 要:Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining non-trivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.