Recently, transformer models have been widely deployed in natural language processing and image processing. However, its superior performance comes with high amount of parameters and computations which make it difficu...
详细信息
ISBN:
(纸本)9798350386288;9798350386271
Recently, transformer models have been widely deployed in natural language processing and image processing. However, its superior performance comes with high amount of parameters and computations which make it difficult to deploy transformer models in resource limited devices. To reduce the computation cost of transformer models, in this paper, an improved network pruning method is proposed. In the proposed pruning method, the parameter matrix is decomposed into blocks of a specific size. Then, pruning is applied to each block so that the number of parameters remaining in each block is the same. To further reduce the memory requirement of the parameters, an efficient memory storagepattern for sparse parameters is also proposed in this paper. Finally, by combining the proposed methods, an energy efficient transformer accelerator architecture is proposed. The proposed accelerator is implemented in FPGA devices and implementation results show that the proposed design can significantly improve the speed performance and energy efficiency when compared with previous designs.
暂无评论