检索结果-内蒙古大学图书馆

IEEE International Conference on Joint Cloud Computing (JCC)

作者： Zhilin Yang Yu Tang Linbo Qiao Xi Yang Zhen Huang National Key Laboratory of Parallel and Distributed Computing College of Computer Science National University of Defense Technology Changsha 410073 China

The scale of model parameters and the amount of training data is exponentially increasing. It requires more GPU memory with the exponential increasement of model parameters. Recomputation and swapping are two main memory optimization methods that have been extensively studied, and there are also optimization strategies that combine the two methods. However, most of them are based on heuristic search strategies, which do not explore the complete solution space and can’t guarantee the optimality of the solution results. An optimal search strategy with tensor-level recomputation and swapping is expected in large-scale model training. In this paper, we propose an optimal strategy searching algorithm combining tensor-based recomputation and swapping. Specifically, the memory swapping strategy is reformulated as an optimization problem, which converts the memory constraints into mixed integer programming, to find the optimal memory optimization strategy. By leveraging the advantages of both recomputation and swapping, this approach minimizes computation consumption without exceeding the available memory limitation. Experimental results show that our method exhibits about 60% reduction in memory requirements during the training process. Furthermore, our method can reduce the overall training time beyond the existing algorithms. Compared to Checkmate, our approach achieves about 0.3–0.9% reduction in computation cost per iteration.

关键词：

来源：评论

学校读者我要写书评

暂无评论

A Novel Interactive Recurrent Attention Network for Emotion-Cause Pair Extraction 3

A Novel Interactive Recurrent Attention Network for Emotion-...

引用

3rd International Conference on Algorithms, Computing and Artificial Intelligence, ACAI 2020

作者： Jia, Xiangyu Chen, Xinhai Wan, Qian Liu, Jie Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha China

ISBN: (纸本)9781450388115

Unlike Emotion Cause Extraction (ECE) task which consists of pre-annotate emotions and passage, emotion-cause pair extraction (ECPE) aims at extracting potential emotions and corresponding causes in the document without the need for pre-annotations. Traditional ECPE solutions divide the extracting emotions and causes operation into two separate parts. However, separating the bidirectional dependence between emotion and cause may lose a lot of potentially useful information. In this paper, we propose a novel interactive recurrent attention network (IRAN). Our approach focuses on the bidirectional impact between emotions and causes, and extracts emotions and causes simultaneously. The information in the document can be fully exploited through multiple modeling and information extraction. Our emotion-specific transformation and distance fusion correlation can adaptively focus on the emotions and the distance, gracefully incorporate them into a distinguishable neural network attention framework. The experimental results show that our proposed model achieves better performance than other widely-used models on the ECPE corpus. © 2020 ACM.

关键词： Sentiment analysis

来源：评论

学校读者我要写书评

暂无评论

A Connectivity-Enhanced Multi-Task Learning based on Anatomical Priors for 3D Class-Balanced Pulmonary Airway Segmentation

A Connectivity-Enhanced Multi-Task Learning based on Anatomi...

引用

IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

作者： Yan Jia Yong Dou Peng Qiao Yuqing Cheng Kele Xu Zhouyu He National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha China College of Systems Engineering National University of Defense Technology Changsha China

ISBN: (数字)9798350386226

ISBN: (纸本)9798350386233

Accurate and efficient airway segmentation is essential for evaluating pulmonary diseases, aiding diagnosis, reducing the preoperative burden of airway identification, and minimizing patient discomfort during prolonged surgeries. However, current pulmonary airway reconstruction techniques are hindered by two major challenges: difficulty in accurately reconstructing fine airway branches due to the tendency to overlook small targets, and insufficient structural connectivity leading to frequent branch discontinuities within the airway tree. These limitations directly affect the clinical applicability of reconstructed airways. To overcome these challenges, a novel 3D pulmonary airway segmentation multi-task framework is proposed, designed to enhance the performance of existing backbone models. This approach integrates Anatomical Prior-Based Multi-Task Learning (AP-MTL) through the use of Gaussian-constructed connectivity-enhanced isosurfaces, significantly improving the network’s ability to maintain airway continuity. Additionally, a Class-Balanced CT Density Distribution Reconstruction mechanism (DDR-CB) is introduced, further refining the model’s capability to detect and segment fine airway branches. As a result of these enhancements, the model demonstrates a 11.5% average improvement in segmentation accuracy and connectivity compared to the baseline. The source code is publicly accessible at https://***/inexhaustible419/APMTLAirwaySegment.

关键词： Image segmentation Three-dimensional displays Accuracy Lungs Atmospheric modeling Supervised learning Surgery Multitasking Image reconstruction Biomedical imaging

来源：评论

学校读者我要写书评

暂无评论

Merak: An Efficient distributed DNN Training Framework with Automated 3D parallelism for Giant Foundation Models

arXiv

引用

arXiv 2022年

作者： Lai, Zhiquan Li, Shengwei Tang, Xudong Ge, Keshi Liu, Weijie Duan, Yabo Qiao, Linbo Li, Dongsheng The National Laboratory for Parallel and Distributed Processing College of Computer National University of Defense Technology in Changsha Hunan China

Foundation models are in the process of becoming the dominant deep learning technology. Pretraining a foundation model is always time-consuming due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the pretraining process is extremely memory- and communication-intensive. These challenges make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism, and tensor model parallelism, to achieve high training efficiency. However, current 3D parallelism frameworks still encounter two issues: i) they are not transparent to model developers, requiring manual model modification to parallelize training, and ii) their utilization of computation resources, GPU memory, and network bandwidth is insufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys 3D parallelism with an automatic model partitioner, which includes a graph-sharding algorithm and proxy node-based model graph. Merak also offers a non-intrusive API to scale out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine that employs several techniques to exploit available training resources, including a shifted critical path pipeline schedule that increases computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs demonstrate Merak's capability to speed up training performance over state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42, 1.39, 1.43, and 1.61×, respectively. The code for Merak has been open-sourced at https://***/hpdl-group/Merak. Copyright © 2022, The Authors. All rights reserved.

关键词： Pipelines

来源：评论

学校读者我要写书评

暂无评论

SCGraph: Accelerating Sample-based GNN Training by Staged Caching of Features on GPUs

SCGraph: Accelerating Sample-based GNN Training by Staged Ca...

引用

IEEE International Conference on Big Data and Cloud Computing (BdCloud)

作者： Yuqi He Zhiquan Lai Zhejiang Ran Lizhi Zhang Dongsheng Li National Key Laboratory of Parallel and Distributed Processing College of Computer National University of Defense Technology Changsha China

Graph neural networks (GNNs) have been becoming important tools for processing structured graph data and successfully applied to multiple graph-based application scenarios. The existing GNN systems adopt sample-based training on large-scale graphs over multiple GPUs. Although they support large-scale graph training, large data loading overhead of transferring vertex features between CPUs and GPUs is still a bottleneck. In this work, we propose SCGraph, a method that supports GPU high-speed feature caching. SCGraph classifies the graph vertices sorted by out-degrees. For high out-degree vertices, SCGraph sets grading caches via different GPUs to increase the overall cache content through NVLink high-speed data transmission between them. For low out-degree vertices, SCGraph expands training vertices' neighborhood in advance to regenerate cache. We evaluate SCGraph against two state-of-the-art industrial GNN frameworks, i.e., DGL and PaGraph on various benchmarks. Experimental results show that SCGraph improves the cache hit rate over GPUs up to 23.6%, and achieves up to 1.71x performance speedup over the state-of-the-art baselines while the convergence almost constant.

关键词： Training Memory management Loading Graphics processing units Benchmark testing Graph neural networks Data communication

来源：评论

学校读者我要写书评

暂无评论

Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration

arXiv

引用

arXiv 2023年

作者： Qiao, Peng Liu, Sidun Sun, Tao Yang, Ke Dou, Yong The Science and Technology on Parallel and Distributed Laboratory School of Computer National University of Defense Technology Changsha China National Innovation Institute of Defense Technology Beijing China

The great success of Deep Neural Networks (DNNs) has inspired the algorithmic development of DNN-based Fixed-Point (DNN-FP) for computer vision tasks. DNN-FP methods, trained by Back-Propagation Through Time or computing the inaccurate inversion of the Jacobian, suffer from inferior representation ability. Motivated by the representation power of the Transformer, we propose a framework to unroll the FP and approximate each unrolled process via Transformer blocks, called FPformer. To reduce the high consumption of memory and computation, we come up with FPRformer by sharing parameters between the successive blocks. We further design a module to adapt Anderson acceleration to FPRformer to enlarge the unrolled iterations and improve the performance, called FPAformer. In order to fully exploit the capability of the Transformer, we apply the proposed model to image restoration, using self-supervised pre-training and supervised fine-tuning. 161 tasks from 4 categories of image restoration problems are used in the pre-training phase. Hereafter, the pre-trained FPformer, FPRformer, and FPAformer are further fine-tuned for the comparison scenarios. Using self-supervised pre-training and supervised fine-tuning, the proposed FPformer, FPRformer, and FPAformer achieve competitive performance with state-of-the-art image restoration methods and better training efficiency. FPAformer employs only 29.82% parameters used in SwinIR models, and provides superior performance after fine-tuning. To train these comparison models, it takes only 26.9% time used for training SwinIR models. It provides a promising way to introduce the Transformer in low-level vision tasks. © 2023, CC BY.

关键词： Image reconstruction

来源：评论

学校读者我要写书评

暂无评论

A Dynamic Mapping Model for General CNN Accelerator Based on FPGA 17th

A Dynamic Mapping Model for General CNN Accelerator Based on...

引用

17th IFIP WG 10.3 International Conference on Network and parallel Computing, NPC 2020

作者： Zhao, Xiaoqiang Jiang, Jingfei Han, Zhe Xu, Jinwei Liu, Zhiqiang National Laboratory for Parallel and Distributed Processing National University of Defense Technology Changsha China Artificial Intelligence Research Center National Innovation Institute of Defense Technology Beijing China

ISBN: (纸本)9783030794774

As the application scenarios of convolutional neural network (CNN) become more and more complex, the general CNN accelerator based on matrix multiplication has become a new research focus. The existing mapping methods for converting convolution calculation into matrix multiplication need to be improved. This paper proposes a new dynamic mapping model to improve the flexibility and versatility of matrix multiplication. The dynamic mapping model implements two algorithms: dynamic residue processing mapping algorithm (DRPMA) and dilated convolution mapping algorithm (DCMA). The former can dynamically adjust the mapping method according to the number of output channels of the convolution layer, improve the utilization of the multiply-accumulate (MAC) array. The latter extends the efficient support for Dilated CNNs. For demonstration, we implement an accelerator with Verilog on Xilinx VC709 FPGA board and test some typical CNN models. Experimental results show that the general accelerator achieves high performance and energy efficiency. © 2021, IFIP International Federation for Information processing.

关键词： Field programmable gate arrays (FPGA)

来源：评论

学校读者我要写书评

暂无评论

DrugProtKGE: Weakly Supervised Knowledge Graph Embedding for Highly-Effective Drug-Protein Interaction Representation

DrugProtKGE: Weakly Supervised Knowledge Graph Embedding for...

引用

2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023

作者： Qiu, Yanlong Wang, Siqi Yang, Xi Qiu, Xinyuan Wu, Chengkun Cui, Yingbo Yang, Canqun National University of Defense Technology Institute for Quantum Information State Key Laboratory of High-Performance Computing College of Computer Science Hunan Changsha410073 China National Supercomputer Center in Tianjin Tianjin300457 China National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing College of Computer Science Hunan Changsha410073 China National University of Defense Technology Department of Biology and Chemistry College of Science Hunan Changsha410073 China

ISBN: (纸本)9798350337488

With the exponential growth of biomedical knowledge in unstructured text repositories such as PubMed, it is imminent to establish a knowledge graph-style, efficient searchable and targeted database that can support the need of information retrieval from researchers and clinicians. To mine knowledge from graph databases, most previous methods view a triple in a graph (see Fig. 1) as the basic processing unit and embed the triplet element (i.e. drugs/chemicals, proteins/genes and their interaction) as separated embedding matrices, which cannot capture the semantic correlation among triple elements. To remedy the loss of semantic correlation caused by disjoint embeddings, we propose a novel approach to learn triple embeddings by combining entities and interactions into a unified representation. Furthermore, traditional methods usually learn triple embeddings from scratch, which cannot take advantage of the rich domain knowledge embedded in pre-trained models, and is also another significant reason for the fact that they cannot distinguish the differences implied by the same entity in the multi-interaction triples. In this paper, we propose a novel fine-tuning based approach to learn better triple embeddings by creating weakly supervised signals from pre-trained knowledge graph embeddings. The method automatically samples triples from knowledge graphs and estimates their pairwise similarity from pre-trained embedding models. The triples are then fed pairwise into a Siamese-like neural architecture, where the triple representation is fine-tuned in the manner bootstrapped by triple similarity scores. Finally, we demonstrate that triple embeddings learned with our method can be readily applied to several downstream applications (e.g. triple classification and triple clustering). We evaluated the proposed method on two open-source drug-protein knowledge graphs constructed from PubMed abstracts, as provided by BioCreative. Our method achieves consistent improvement in both t

关键词： Drug-Protein Interaction Knowledge Graph Embedding Triple Embedding Weakly Supervised Learning

来源：评论

学校读者我要写书评

暂无评论

Research on accelerating CEE-DGCNN event extraction algorithm based on multi-CPU platform 2

Research on accelerating CEE-DGCNN event extraction algorith...

引用

2020 2nd International Conference on Electronics and Communication, Network and Computer technology, ECNCT 2020

作者： Li, Yan Jiang, Jinfei Xu, Jinwei Gao, Zikai Dou, Yong Zhou, Haifang Science and Technology on Parallel and Distributed Laboratory National University of Defense Technology Changsha410005 China Department of Computational Science National University of Defense Technology Changsha410005 China

With the development of the Internet, the volume of information is expanding rapidly, and the complex information makes it particularly important to extract information quickly and intelligently. Event extraction algorithm is such a kind of fast and accurate information acquisition algorithm. With the development of deep learning in recent years, deep learning also shines in the field of time extraction. CEE-DGCNN network is an efficient and accurate event extraction algorithm for Chinese text. At present, the acceleration research on deep learning in the academic community mainly focuses on GPU, FPGA and other platforms. In fact, compared with these platforms, CPU is more versatile. Considering that the algorithm should have universality and universality in industrial application, this paper makes an in-depth analysis of CEE-DGCNN algorithm to study its acceleration on CPU. Experiments were carried out on x86 and ARM architectures respectively, and good acceleration effects were achieved. © 2020 Published under licence by IOP Publishing Ltd.

关键词： Extraction

来源：评论

学校读者我要写书评

暂无评论

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

arXiv

引用

arXiv 2022年

作者： Yin, Shangfei Wang, Qinglin Hao, Ruochen Zhou, Tianyang Mei, Songzhu Liu, Jie Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha410073 China College of Computer National University of Defense Technology Changsha410073 China

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FTm7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2× performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 3.1× higher efficiency. Copyright © 2022, The Authors. All rights reserved.

关键词： Digital signal processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：