In recent WLAN standards (such as IEEE 802.11n), MIMO (Multiple Input Multiple Output) is deployed to provide high data transmission rate. It is however challenging to efficiently share the channel resources among dif...
详细信息
The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redund...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A 3 , SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup.
Time series data are pervasive in varied real-world applications, and accurately identifying anomalies in time series is of great importance. Many current methods are insufficient to model long-term dependence, wherea...
Time series data are pervasive in varied real-world applications, and accurately identifying anomalies in time series is of great importance. Many current methods are insufficient to model long-term dependence, whereas some anomalies can be only identified through long temporal contextual information. This may finally lead to disastrous outcomes due to false negatives of these anomalies. Prior arts employ Transformers (i.e., a neural network architecture that has powerful capability in modeling long-term dependence and global association) to alleviate this problem; however, Transformers are insensitive in sensing local context, which may neglect subtle anomalies. Therefore, in this paper, we propose a local-adaptive Transformer based on cross-correlation for time series anomaly detection, which unifies both global and local information to capture comprehensive time series patterns. Specifically, we devise a cross-correlation mechanism by employing causal convolution to adaptively capture local pattern variation, offering diverse local information into the long-term temporal learning process. Furthermore, a novel optimization objective is utilized to jointly optimize reconstruction of the entire time series and matrix derived from cross-correlation mechanism, which prevents the cross-correlation from becoming trivial in the training phase. The generated cross-correlation matrix reveals underlying interactions between dimensions of multivariate time series, which provides valuable insights into anomaly diagnosis. Extensive experiments on six real-world datasets demonstrate that our model outperforms state-of-the-art competing methods and achieves 6.8%-27.5% $F_{1}$ score improvement. Our method also has good anomaly interpretability and is effective for anomaly diagnosis.
As deep learning grows rapidly, model training heavily relies on parallel methods and there exist numerous cluster configurations. However, current preferences for parallel training focus on data centers, overlooking ...
As deep learning grows rapidly, model training heavily relies on parallel methods and there exist numerous cluster configurations. However, current preferences for parallel training focus on data centers, overlooking the financial constraints faced by most researchers. To attain the best performance within the cost limitation, we introduce a throughput-cost metric to accurately characterize clusters' cost-effectiveness. Based on this metric, we design a cost-effective cluster featuring the 3090 with NVLink. The experiment results demonstrate that our cluster achieves remarkable cost-effectiveness in various distributed model training schemes.
Encryption technology has become an important mechanism of securing data stored in the outsourced database. However, it is a difficulty to query efficiently the encrypted data and many researchers take it into conside...
详细信息
Multidimensional parallel training has been widely applied to train large-scale deep learning models like GPT-3. The efficiency of parameter communication among training devices/processes is often the performance bott...
Multidimensional parallel training has been widely applied to train large-scale deep learning models like GPT-3. The efficiency of parameter communication among training devices/processes is often the performance bottleneck of large model training. Analysis of parameter communication mode and traffic has important reference significance for the research of interconnection network design and computing task scheduling to improve the training performance. In this paper, we analyze the parametric communication modes in typical 3D parallel training (data parallelism, pipeline parallelism, and tensor parallelism), and model the traffic in different communication modes. Finally, taking GPT-3 as an example, we present the communication in its 3D parallel training.
In this paper, we introduce a generic model to deal with the event matching problem of content-based publish/ subscribe systems over structured P2P overlays. In this model, we claim that there are three methods (event...
详细信息
The deep neural named entity recognition model automatically learns and extracts the features of entities and solves the problem of the traditional model relying heavily on complex feature engineering and obscure prof...
详细信息
Neural Radiance Field (NeRF) has received widespread attention for its photo-realistic novel view synthesis quality. Current methods mainly represent the scene based on point sampling of ray casting, ignoring the infl...
Neural Radiance Field (NeRF) has received widespread attention for its photo-realistic novel view synthesis quality. Current methods mainly represent the scene based on point sampling of ray casting, ignoring the influence of the observed area changing with distance. In addition, The current sampling strategies are all focused on the distribution of sampling points on the ray, without paying attention to the sampling of the ray. We found that the current ray sampling strategy for scenes with the camera moving forward severely reduces the convergence speed. In this work, we extend the point representation to area representation by using relative positional encoding, and propose a ray sampling strategy that is suitable for camera trajectory moving forward. We validated the effectiveness of our method on multiple public datasets.
Constant degree peer-to-peer (P2P) system is turning into the P2P domain's promising hotspot due to constant degree digraphs having good propertis. However, it is often hard to convert a stardard constant degree d...
详细信息
Constant degree peer-to-peer (P2P) system is turning into the P2P domain's promising hotspot due to constant degree digraphs having good propertis. However, it is often hard to convert a stardard constant degree digraph to a DHT schema. Thus, most researches focus on DHT's construction and maintenance, while leaving optimization and supporting to complex query behind. Underlying topology affects upper-layers' character a lot. For constant degree P2P topologies, their inherent property makes a constant degree P2P system built using classical technique be poor in the data locality, and unfit for efficient, low-cost complex queries. Aiming at this shortage, a general-purpose construction technique towards efficient complex queries is proposed, which adds an embedding transformation layer between data layer and DHT overlay. In this way, adjacent data are stored in overlay's adjacent peers and the data locality is improved, so that the number of peers referred in complex queries can be minimized with a limited time overhead. To validate this technology, the first constant degree P2P system based on Kautz digraph FissionE is reconstructed as a typical example, which includes re-allocating of resources, query algorithm and locality maintenance strategies. Experimental results show that this construction technique can ensure data locality, reduce query cost and lead to systems' efficiency without changing the underlying DHT layer.
暂无评论