In the task of Knowledge Graph Completion (KGC), the existing datasets and their inherent subtasks carry a wealth of shared knowledge that can be utilized to enhance the representation of knowledge triplets and overal...
详细信息
The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redund...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A 3 , SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup.
The key to anomaly detection in time series data streams (TSDS) lies in the ability to adapt to evolving data. Active learning for anomaly detection has shown such ability by leveraging expert feedback. However, many ...
The key to anomaly detection in time series data streams (TSDS) lies in the ability to adapt to evolving data. Active learning for anomaly detection has shown such ability by leveraging expert feedback. However, many studies in this research line strive to optimize performance by exhausting the query budget, lacking consideration of query necessity, which means some unnecessary queries may wrongly lead the model to overfit the trivial information and incur additional consumption in both human labeling and model execution. This paper proposes Boundary-driven Active Learning for Anomaly Detection (BALAD). BALAD utilizes deep one-class classification to construct a hypersphere boundary to sense data abnormality and filters out unnecessary queries by dividing the boundary region. We further harness the hypersphere boundary to quantitatively measure data difficulty, and a focal loss is introduced to prioritize hard samples. The boundary is flexibly adapted during each feedback iteration to accommodate changes in TSDS. Extensive experiments on six datasets demonstrate that BALAD significantly outperforms the state-of-the-art anomaly detection methods.
The breadth-first search (BFS) algorithm is a fundamental algorithm in graph theory, and it’s parallelization can significantly improve performance. Therefore, there have been numerous efforts to leverage the powerfu...
详细信息
Large deep neural network (DNN) models have demonstrated exceptional performance across diverse downstream tasks. Sharded data parallelism (SDP) has been widely used to reduce the memory footprint of model states. In ...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
Large deep neural network (DNN) models have demonstrated exceptional performance across diverse downstream tasks. Sharded data parallelism (SDP) has been widely used to reduce the memory footprint of model states. In a DNN training cluster, a device usually has multiple inter-device links that connect to other devices, like NVLink and InfiniBand. However, existing SDP approaches employ a single link at any given time, encountering challenges in efficient training due to significant communication overheads. We observe that the inter-device links can work independently without affecting each other. To reduce the fatal communication overhead of distributed training of large DNNs, this paper introduces HSDP, an efficient SDP training approach that enables the simultaneous utilization of multiple inter-device links. HSDP partitions models in a novel fine-grained manner and orchestrates the communication processes of partitioned parameters while considering inter-device links. This design enables concurrent communication execution and reduces communication overhead. To further optimize the training performance of HSDP, we propose a HSDP planner. The HSDP planner first abstracts the model partition and execution of HSDP into a communication parallel strategy, and builds a cost model to estimate the performance of each strategy. We then formulate the strategy searching as an optimization problem and solve it with an off-the-shelf solver. Evaluations on representative DNN workloads demonstrate that HSDP achieves up to 1.30× speedup compared to the state-of-the-art SDP training approaches.
Transformer models, such as BERT, GPT, and ViT, have been applied to a wide range of areas in recent years, due to their efficacy. In order to improve the training efficiency of Transformer models, different distribut...
Transformer models, such as BERT, GPT, and ViT, have been applied to a wide range of areas in recent years, due to their efficacy. In order to improve the training efficiency of Transformer models, different distributed training approaches have been proposed, like Megatron-LM [8]. However, when multi-dimensional parallelism strategies are considered, due to the complexity, existing works can not harmonize the different strategies well enough to obtain a globally optimal solution. In this paper, we propose a parallelism strategy searching algorithm PTIP, which generates operator-level parallelism strategies consisting of three schemes: data parallelism, tensor parallelism, and pipeline parallelism. PTIP abstracts these three parallelism schemes simultaneously into an auxiliary graph, reformulates the searching problem into a mixed-integer programming (MIP) problem, and uses a MIP solver to obtain a high-quality multi-dimensional strategy. Experiments conducted on Transformers demonstrate that PTIP obtains 13.9% − 24.7% performance improvement compared to Megatron-LM [8].
he size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-c...
详细信息
he size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor communication-computation overlapping. To address these challenges, we propose a systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models. To adapt to the dynamic load imbalance, we profile training statistics and use them to design Pro-Prophet. For lower communication volume, Pro-Prophet planner determines a series of lightweight load-balancing strategies and efficiently searches for a communication-efficient one for training based on the statistics. For sufficient overlapping of communication and computation, Pro-Prophet scheduler schedules the data-dependent operations based on the statistics and operation features, further improving the training throughput. We conduct extensive experiments in four clusters and five MoE models. The results indicate that Pro-Prophet achieves up to 2.66x speedup compared to two popular MoE frameworks including Deepspeed-MoE and FasterMoE. Furthermore, Pro-Prophet has demonstrated a load-balancing improvement of up to 11.01x compared to a representative load-balancing work,
Temporal Knowledge Graph Completion (TKGC) aims to predict missing parts of quadruples, which is crucial for real-life knowledge graphs. Compared with methods that only use graph neural networks, the emergence of pre-...
详细信息
ISBN:
(数字)9798350359312
ISBN:
(纸本)9798350359329
Temporal Knowledge Graph Completion (TKGC) aims to predict missing parts of quadruples, which is crucial for real-life knowledge graphs. Compared with methods that only use graph neural networks, the emergence of pre-trained model has introduced a trend of simultaneously leveraging text and graph structure information. However, most current methods based on pre-trained models struggle to effectively utilize both text and multi-hop graph structure information concurrently, resulting in insufficient association mining of relations. To address the challenge, we propose a novel model: Temporal Closing Path for Pre-trained Language Model-based TKGC (TCP-PLM). We obtain the temporal closing relation path of the target relation through sampling, and use the relation path as a bridge to simultaneously utilize text and multi-hop graph structure information. Moreover, the relation path serves as a tool for mining associations between relations. At the same time, due to the design of entity-independent relation paths, our model can also handle the inductive setting. Our experiments on three benchmarks, along with extensive analysis, demonstrate that our model not only achieves substantial performance enhancements across four metrics compared to other models but also adeptly handles inductive settings.
Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging *** pull-based development model,as the state-of-art collaborative development mec...
详细信息
Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging *** pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'***,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this *** not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant *** this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission *** a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing *** then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change *** evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.
Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large lang...
详细信息
暂无评论