检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Wu, Weimin Su, Maojiang Hu, Jerry Yao-Chieh Song, Zhao Liu, Han Center for Foundation Models and Generative AI Northwestern University EvanstonIL60208 United States Department of Computer Science Northwestern University EvanstonIL60208 United States Simons Institute for the Theory of Computing UC Berkeley BerkeleyCA94720 United States Department of Statistics and Data Science Northwestern University EvanstonIL60208 United States

We investigate the transformer’s capability to simulate the training process of deep models via in-context learning (ICL), i.e., in-context deep learning. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a (2N+4)L-layer transformer capable of simulating L gradient descent steps of an N-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training. © 2024, CC BY-NC-SA.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality

arXiv

引用

arXiv 2024年

作者： Hu, Jerry Yao-Chieh Wu, Weimin Lee, Yi-Chen Huang, Yu-Chao Chen, Minshuo Liu, Han Center for Foundation Models and Generative AI Northwestern University EvanstonIL60208 United States Department of Computer Science Northwestern University EvanstonIL60208 United States Department of Physics National Taiwan University Taipei106319 Taiwan Physics Division National Center for Theoretical Sciences Taipei106319 Taiwan Department of Industrial Engineering & Management Sciences Northwestern University EvanstonIL60208 United States Department of Statistics and Data Science Northwestern University EvanstonIL60208 United States

We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for "in-context" conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers’ universal approximation through a more detailed piecewise constant approximation and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models. © 2024, CC BY-NC-SA.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

arXiv

引用

arXiv 2024年

作者： Hu, Jerry Yao-Chieh Wang, Wei-Po Gilani, Ammar Li, Chenyang Song, Zhao Liu, Han Center for Foundation Models and Generative AI Northwestern University EvanstonIL60208 United States Department of Computer Science Northwestern University EvanstonIL60208 United States Department of Physics National Taiwan University Taipei10617 Taiwan Maynooth International School of Engineering Fuzhou University Fuzhou350108 China Simons Institute for the Theory of Computing UC Berkeley BerkeleyCA94720 United States Department of Statistics and Data Science Northwestern University EvanstonIL60208 United States

We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on single-head transformers with only a single self-attention layer: (i) is universal, and (ii) supports efficient (even nearlylinear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-dL and -in- (1/ϵ) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the soft-prompt-induced keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners. © 2024, CC BY-NC-SA.

关键词： Distribution transformers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：