检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Shan, Yongxue Zhou, Jie Peng, Jie Zhou, Xin Yin, Jiaqian Wang, Xiaodong National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Denfense Technology Changsha China

In the task of Knowledge Graph Completion (KGC), the existing datasets and their inherent subtasks carry a wealth of shared knowledge that can be utilized to enhance the representation of knowledge triplets and overall performance. However, no current studies specifically address the shared knowledge within KGC. To bridge this gap, we introduce a multi-level Shared Knowledge Guided learning method (SKG) that operates at both the dataset and task levels. On the dataset level, SKG-KGC broadens the original dataset by identifying shared features within entity sets via text summarization. On the task level, for the three typical KGC subtasks – head entity prediction, relation prediction, and tail entity prediction – we present an innovative multi-task learning architecture with dynamically adjusted loss weights. This approach allows the model to focus on more challenging and underperforming tasks, effectively mitigating the imbalance of knowledge sharing among subtasks. Experimental results demonstrate that SKG-KGC outperforms existing text-based methods significantly on three well-known datasets, with the most notable improvement on WN18RR (MRR: 66.6%→72.2%, Hit@1: 58.7%→67.0%). Copyright © 2024, The Authors. All rights reserved.

关键词： Knowledge graph

来源：评论

学校读者我要写书评

暂无评论

Funnel: An Efficient Sparse Attention Accelerator with Multi-Dataflow Fusion

Funnel: An Efficient Sparse Attention Accelerator with Multi...

引用

International Symposium on parallel and distributed Processing with Applications, ISPA

作者： Shenghong Ma Jinwei Xu Jingfei Jiang Yaohua Wang Dongsheng Li National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Changsha China

ISBN: (数字)9798331509712

ISBN: (纸本)9798331509729

The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A 3 , SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup.

关键词： Quantization (signal) Graphics processing units Computer architecture Transformer cores Transformers Prediction algorithms Hardware Spatiotemporal phenomena Sparse matrices Sorting

来源：评论

学校读者我要写书评

暂无评论

Boundary-Driven Active Learning for Anomaly Detection in Time Series Data Streams

Boundary-Driven Active Learning for Anomaly Detection in Tim...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xiaohui Zhou Yijie Wang Hongzuo Xu Mingyu Liu National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Changsha China

The key to anomaly detection in time series data streams (TSDS) lies in the ability to adapt to evolving data. Active learning for anomaly detection has shown such ability by leveraging expert feedback. However, many studies in this research line strive to optimize performance by exhausting the query budget, lacking consideration of query necessity, which means some unnecessary queries may wrongly lead the model to overfit the trivial information and incur additional consumption in both human labeling and model execution. This paper proposes Boundary-driven Active Learning for Anomaly Detection (BALAD). BALAD utilizes deep one-class classification to construct a hypersphere boundary to sense data abnormality and filters out unnecessary queries by dividing the boundary region. We further harness the hypersphere boundary to quantitatively measure data difficulty, and a focal loss is introduced to prioritize hard samples. The boundary is flexibly adapted during each feedback iteration to accommodate changes in TSDS. Extensive experiments on six datasets demonstrate that BALAD significantly outperforms the state-of-the-art anomaly detection methods.

关键词：

来源：评论

学校读者我要写书评

暂无评论

PEbfs: Implement High-Performance Breadth-First Search on PEZY-SC3s 24th

PEbfs: Implement High-Performance Breadth-First Search on P...

引用

24th International Conference on Algorithms and Architectures for parallel Processing, ICA3PP 2024

作者： Guo, Weihao Wang, Qinglin Liu, Xiaodong Peng, Muchun Yang, Shun Liang, Yaling Shi, Yongzhen Cao, Ligang Liu, Jie Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology Changsha410073 China National Key Laboratory of Parallel and Distributed Computing National University of Defense Technology Changsha410073 China Engineering Research Center for National Fundamental Software National University of Defense Technology Changsha410073 China

ISBN: (纸本)9789819615506

The breadth-first search (BFS) algorithm is a fundamental algorithm in graph theory, and it’s parallelization can significantly improve performance. Therefore, there have been numerous efforts to leverage the powerful parallel computing capabilities of hardware like GPGPU to implement high-performance BFS algorithms. However, the energy efficiency is relatively low due to the high power consumption of the platforms on which the algorithm is adapted to. To deal with these challenges, this paper introduces PEbfs that is a high-performance BFS algorithm based on the PEZY-SC3s efficient processor. We integrated three search algorithms, two algorithm optimization strategies, and a directional optimization scheme into PEbfs. Through multiple evaluations of the performance of PEbfs on the public SNAP dataset, the results demonstrate that the average energy efficiency ratio of PEbfs is higher than that of Enterprise and Tigr, the two most advanced implementations on Nvidia’s GPGPU: It achieves 3.08× the average energy efficiency ratio of Enterprise and 4.53× that of Tigr. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

关键词： Optimization algorithms

来源：评论

学校读者我要写书评

暂无评论

HSDP: Accelerating Large-scale Model Training via Efficient Sharded Data parallelism

HSDP: Accelerating Large-scale Model Training via Efficient ...

引用

International Symposium on parallel and distributed Processing with Applications, ISPA

作者： Yanqi Hao Zhiquan Lai Wei Wang Shengwei Li Weijie Liu Keshi Ge Dongsheng Li National Key Laboratory of Parallel and Distributed Computing(PDL) College of Computer National University of Defense Technology Changsha China

ISBN: (数字)9798331509712

ISBN: (纸本)9798331509729

Large deep neural network (DNN) models have demonstrated exceptional performance across diverse downstream tasks. Sharded data parallelism (SDP) has been widely used to reduce the memory footprint of model states. In a DNN training cluster, a device usually has multiple inter-device links that connect to other devices, like NVLink and InfiniBand. However, existing SDP approaches employ a single link at any given time, encountering challenges in efficient training due to significant communication overheads. We observe that the inter-device links can work independently without affecting each other. To reduce the fatal communication overhead of distributed training of large DNNs, this paper introduces HSDP, an efficient SDP training approach that enables the simultaneous utilization of multiple inter-device links. HSDP partitions models in a novel fine-grained manner and orchestrates the communication processes of partitioned parameters while considering inter-device links. This design enables concurrent communication execution and reduces communication overhead. To further optimize the training performance of HSDP, we propose a HSDP planner. The HSDP planner first abstracts the model partition and execution of HSDP into a communication parallel strategy, and builds a cost model to estimate the performance of each strategy. We then formulate the strategy searching as an optimization problem and solve it with an off-the-shelf solver. Evaluations on representative DNN workloads demonstrate that HSDP achieves up to 1.30× speedup compared to the state-of-the-art SDP training approaches.

关键词： Training Performance evaluation Costs distributed databases Artificial neural networks parallel processing Throughput Search problems Data models Optimization

来源：评论

学校读者我要写书评

暂无评论

3D parallelism for Transformers via Integer Programming

3D Parallelism for Transformers via Integer Programming

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Hao Zheng Peng Liang Yu Tang Yanqi Shi Linbo Qiao Dongsheng Li National Key Laboratory of Parallel and Distributed Computing National University Of Defense Technology Changsha P.R.China

Transformer models, such as BERT, GPT, and ViT, have been applied to a wide range of areas in recent years, due to their efficacy. In order to improve the training efficiency of Transformer models, different distributed training approaches have been proposed, like Megatron-LM [8]. However, when multi-dimensional parallelism strategies are considered, due to the complexity, existing works can not harmonize the different strategies well enough to obtain a globally optimal solution. In this paper, we propose a parallelism strategy searching algorithm PTIP, which generates operator-level parallelism strategies consisting of three schemes: data parallelism, tensor parallelism, and pipeline parallelism. PTIP abstracts these three parallelism schemes simultaneously into an auxiliary graph, reformulates the searching problem into a mixed-integer programming (MIP) problem, and uses a MIP solver to obtain a high-quality multi-dimensional strategy. Experiments conducted on Transformers demonstrate that PTIP obtains 13.9% − 24.7% performance improvement compared to Megatron-LM [8].

关键词：

来源：评论

学校读者我要写书评

暂无评论

Pro-Prophet: A Systematic Load Balancing Method for Efficient parallel Training of Large-scale MoE Models

arXiv

引用

arXiv 2024年

作者： Wang, Wei Lai, Zhiquan Li, Shengwei Liu, Weijie Ge, Keshi Shen, Ao Su, Huayou Li, Dongsheng The National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Hunan Changsha China

he size of deep learning models has been increasing to enhance model quality. The linear increase in training computation budget with model size means that training an extremely large-scale model is exceedingly time-consuming. Recently, the Mixture of Expert (MoE) has drawn significant attention as it can scale models to extra-large sizes with a stable computation budget. However, inefficient distributed training of large-scale MoE models hinders their broader application. Specifically, a considerable dynamic load imbalance occurs among devices during training, significantly reducing throughput. Several load-balancing works have been proposed to address the challenge. System-level solutions draw more attention for their hardware affinity and non-disruption of model convergence compared to algorithm-level ones. However, they are troubled by high communication costs and poor communication-computation overlapping. To address these challenges, we propose a systematic load-balancing method, Pro-Prophet, which consists of a planner and a scheduler for efficient parallel training of large-scale MoE models. To adapt to the dynamic load imbalance, we profile training statistics and use them to design Pro-Prophet. For lower communication volume, Pro-Prophet planner determines a series of lightweight load-balancing strategies and efficiently searches for a communication-efficient one for training based on the statistics. For sufficient overlapping of communication and computation, Pro-Prophet scheduler schedules the data-dependent operations based on the statistics and operation features, further improving the training throughput. We conduct extensive experiments in four clusters and five MoE models. The results indicate that Pro-Prophet achieves up to 2.66x speedup compared to two popular MoE frameworks including Deepspeed-MoE and FasterMoE. Furthermore, Pro-Prophet has demonstrated a load-balancing improvement of up to 11.01x compared to a representative load-balancing work,

关键词： Budget control

来源：评论

学校读者我要写书评

暂无评论

Temporal Closing Path for PLM-based Temporal Knowledge Graph Completion

Temporal Closing Path for PLM-based Temporal Knowledge Graph...

引用

International Joint Conference on Neural Networks (IJCNN)

作者： Xin Zhou Yongxue Shan Zixuan Dong Haijiao Liu Xiaodong Wang National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Denfense Technology Changsha China

ISBN: (数字)9798350359312

ISBN: (纸本)9798350359329

Temporal Knowledge Graph Completion (TKGC) aims to predict missing parts of quadruples, which is crucial for real-life knowledge graphs. Compared with methods that only use graph neural networks, the emergence of pre-trained model has introduced a trend of simultaneously leveraging text and graph structure information. However, most current methods based on pre-trained models struggle to effectively utilize both text and multi-hop graph structure information concurrently, resulting in insufficient association mining of relations. To address the challenge, we propose a novel model: Temporal Closing Path for Pre-trained Language Model-based TKGC (TCP-PLM). We obtain the temporal closing relation path of the target relation through sampling, and use the relation path as a bridge to simultaneously utilize text and multi-hop graph structure information. Moreover, the relation path serves as a tool for mining associations between relations. At the same time, due to the design of entity-independent relation paths, our model can also handle the inductive setting. Our experiments on three benchmarks, along with extensive analysis, demonstrate that our model not only achieves substantial performance enhancements across four metrics compared to other models but also adeptly handles inductive settings.

关键词： Training Measurement Knowledge engineering Bridges Analytical models Knowledge graphs Transforms

来源：评论

学校读者我要写书评

暂无评论

Detecting Duplicate Contributions in Pull-Based Model CombiningTextual and Change Similarities

引用

Journal of Computer Science & Technology 2021年第1期36卷 191-206页

作者： Zhi-Xing Li Yue Yu Tao Wang Gang Yin Xin-Jun Mao Huai-Min Wang Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense Technology Changsha 410073China Laboratory of Software Engineering for Complex Systems College of ComputerNational University of Defense TechnologyChangsha 410073China

Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging *** pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'***,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this *** not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant *** this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission *** a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing *** then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change *** evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.

关键词： Pull-request Duplicate detection textual similarity change similarity

来源：评论

学校读者我要写书评

暂无评论

Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models

arXiv

引用

arXiv 2024年

作者： Gao, Yifu Qiao, Linbo Kan, Zhigang Wen, Zhihua He, Yongquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Computing National University of Defense Technology Changsha China Xiangjiang Laboratory Changsha China Meituan Beijing China

Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large language models (LLMs) have made considerable progress in their reasoning ability over structured data, their application to the TKGQA task is a relatively unexplored area. This paper first proposes a novel generative temporal knowledge graph question answering framework, GenTKGQA, which guides LLMs to answer temporal questions through two phases: Subgraph Retrieval and Answer Generation. First, we exploit LLM’s intrinsic knowledge to mine temporal constraints and structural links in the questions without extra training, thus narrowing down the subgraph search space in both temporal and structural dimensions. Next, we design virtual knowledge indicators to fuse the graph neural network signals of the subgraph and the text representations of the LLM in a non-shallow way, which helps the open-source LLM deeply understand the temporal order and structural dependencies among the retrieved facts through instruction tuning. Experimental results on two widely used datasets demonstrate the superiority of our model. Copyright © 2024, The Authors. All rights reserved.

关键词： Question answering

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：