Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-out...
详细信息
Transformer-based large language models have displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning by simply augmenting the query with some input-output examples from that task. Despite the empirical success, the mechanics of how to train a Transformer to achieve ICL and the corresponding ICL capacity is mostly elusive due to the technical challenges of analyzing the nonconvex training problems resulting from the nonlinear self-attention and nonlinear activation in Transformers. To the best of our knowledge, this paper provides the first theoretical analysis of the training dynamics of Transformers with nonlinear self-attention and nonlinear MLP, together with the ICL generalization capability of the resulting model. Focusing on a group of binary classification tasks, we train Transformers using data from a subset of these tasks and quantify the impact of various factors on the ICL generalization performance on the remaining unseen tasks with and without data distribution shifts. We also analyze how different components in the learned Transformers contribute to the ICL performance. Furthermore, we provide the first theoretical analysis of how model pruning affects ICL performance and prove that proper magnitude-based pruning can have a minimal impact on ICL while reducing inference costs. These theoretical findings are justified through numerical experiments. Copyright 2024 by the author(s)
Employing Integrated Photonic Chip-Based ONNs for early pancreatic cancer detection, achieved an 80% Dice score, demonstrating efficient, high-speed alternatives to traditional electrical training systems for medical ...
详细信息
The fast deployment of the Phasor Measurement Units (PMUs), especially in the transmission level of the power systems, enables the development of wide area monitoring, protection and control (WAMPC) applications that ...
详细信息
Employing Integrated Photonic Chip-Based ONNs for early pancreatic cancer detection, achieved an 80% Dice score, demonstrating efficient, high-speed alternatives to traditional electrical training systems for medical ...
详细信息
Energy storage systems (ESSs) is an emerging technology that enables increased and effective penetration of renewable energy sources into power systems. ESSs integrated in wind power plants can reduce power generation...
详细信息
Breast cancer is an occurrence of cancer that attacks breast tissue and is the most common cancer among women worldwide, affecting one in eight women. In this modern world, breast cancer image classification simplifie...
详细信息
This paper focuses on a navigation of a Dubins vehicle (DV) to intercept a moving target on a sphere. The uncertainty in the target motion is described by a Brownian motion model. An Itô-type stochastic different...
详细信息
This paper investigates a two-way full-duplex decode-and-forward relaying system under Rician fading channels with imperfect channel state information. In this system, multiple pairs of users exchange information via ...
详细信息
The decentralization of power systems and the rapid deployment of renewable energy sources (RES) has upgraded the role of the Distribution System Operator (DSO) from a passive network observer to an active system oper...
详细信息
This study presents the development of an innovative system designed to facilitate the customers, especially elderly and people with disabilities, in their shopping experience. The proposed solution employes deep lear...
详细信息
暂无评论