检索结果-内蒙古大学图书馆

Detailed and clock-driven simulation for HPC interconnection network

Frontiers of Computer Science 2016年第5期10卷 797-811页

作者： Wenhao ZHOU Juan CHEN Chen CUI Qian WANG Dezun DONG Yuhua TANG State Key Laboratory of High Performance Computing School of Computer National University of Defense Technology Changsha 410073 China Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha 410073 China

Performance and energy consumption of high performance computing （HPC） interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form is very important for the research on HPC software and hardware technologies. To effectively evaluate the per- formance and energy consumption of HPC interconnection networks, this article designs and implements a detailed and clock-driven HPC interconnection network simulation plat- form, called HPC-NetSim. HPC-NetSim uses application- driven workloads and inherits the characteristics of the de- tailed and flexible cycle-accurate network simulator. Besides, it offers a large set of configurable network parameters in terms of topology and routing, and supports router＇s on/off states. We compare the simulated execution time with the real execution time of Tianhe-2 subsystem and the mean error is only 2.7%. In addition, we simulate the network behaviors with different network structures and low-power modes. The results are also consistent with the theoretical analyses.

关键词： high performance computing clock-driven sim-ulation interconnection network BookSim

来源：评论

学校读者我要写书评

暂无评论

Detecting Duplicate Contributions in Pull-Based Model CombiningTextual and Change Similarities

引用

Journal of Computer Science & Technology 2021年第1期36卷 191-206页

作者： Zhi-Xing Li Yue Yu Tao Wang Gang Yin Xin-Jun Mao Huai-Min Wang Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense Technology Changsha 410073China Laboratory of Software Engineering for Complex Systems College of ComputerNational University of Defense TechnologyChangsha 410073China

Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging *** pull-based development model,as the state-of-art collaborative development mechanism,provides high openness and transparency to improve the visibility of contributors'***,duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this *** not detected in time,duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant *** this paper,we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission *** a new-arriving contribution,we first compute textual similarity and change similarity between it and other existing *** then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change *** evaluation shows that 83.4%of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8%using only textual similarity and 78.2%using only change similarity.

关键词： Pull-request Duplicate detection textual similarity change similarity

来源：评论

学校读者我要写书评

暂无评论

U-shaped Dual Attention Transformer: An Efficient Transformer Based on Channel and Spatial Attention 4

U-shaped Dual Attention Transformer: An Efficient Transforme...

引用

4th International Conference on Artificial Intelligence, Robotics, and Communication, ICAIRC 2024

作者： Zhai, Zhaoyuan Qiao, Peng Li, Rongchun Zhou, Zhen National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798331531225

Transformer-based methods have demonstrated remarkable performance on image super-resolution tasks. Due to high computational complexity, researchers have been working to achieve a balance between computation costs and performance. Restormer has achieved commendable balance by utilizing global channel attention. However, the performance is limited by insufficient local pixel reconstruction. In this paper, we propose a U-shaped Dual Attention Transformer (UDAT) with local-global receptive field, addressing the limitation of Restormer in local pixel reconstruction. We propose a dense window channel attention to enhance the local feature representation, more efficient in computational complexity. Experiments demonstrate that our UDAT achieve superior performance compared with Restormer on benchmark datasets, surpassing Restormer by 0.57 dB on the Urban100 dataset. On-par with SwinIR, our method reduces computational complexity by 3.2 times and improves inference speed by 2 times, achieving a better balance between computational costs and performance. © 2024 IEEE.

关键词： Pixels

来源：评论

学校读者我要写书评

暂无评论

A multidimensional approach of evaluating developers 2020

A multidimensional approach of evaluating developers

引用

2nd International Conference on Big Data Engineering, BDE 2020

作者： Zhang, Changqiang Chen, Ming Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology China

ISBN: (纸本)9781450377225

In this paper, we propose an approach to assess the ability of developers based on their behavior data from OSS. Specifically, we classify developers' ability into code ability, project management ability, and social ability. Code efficiency is related to the developer's commit record and the pull-request record. The developer's project management ability is achieved by tracking the developer's commit record. We use regular matching to map the commit behavior to the project management behavior and calculate the developer's project management ability according to the proportion of different behaviors. The social ability of developers is related to the data that developers interact with in the open-source community. We dug for developer reviews on commit, issue, and gist fragments. By calculating the proportion of positive emotions in developer reviews and the proportion of developers interacting with others in the reviews, the social ability of developers is obtained. We get behavioral data from 50 random developers. Twitter's data is used to test the effect of different machine learning algorithms on the accuracy of developer comment polarity judgments. It is found that the combination of SVM, xgboost and random forest have the highest prediction accuracy. Finally, we select 5 students to use Likert scale to score the results. Our score shows that the results are basically in line with expectations. © 2020 ACM.

关键词： Decision trees

来源：评论

学校读者我要写书评

暂无评论

Communication Analysis for Multidimensional parallel Training of Large-scale DNN Models 25

Communication Analysis for Multidimensional Parallel Trainin...

引用

25th IEEE International Conferences on High Performance computing and Communications, 9th International Conference on Data Science and Systems, 21st IEEE International Conference on Smart City and 9th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC/DSS/SmartCity/DependSys 2023

作者： Lai, Zhiquan Hao, Yanqi Li, Shengwei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Multidimensional parallel training has been widely applied to train large-scale deep learning models like GPT-3. The efficiency of parameter communication among training devices/processes is often the performance bottleneck of large model training. Analysis of parameter communication mode and traffic has important reference significance for the research of interconnection network design and computing task scheduling to improve the training performance. In this paper, we analyze the parametric communication modes in typical 3D parallel training (data parallelism, pipeline parallelism, and tensor parallelism), and model the traffic in different communication modes. Finally, taking GPT-3 as an example, we present the communication in its 3D parallel training. © 2023 IEEE.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Funnel: An Efficient Sparse Attention Accelerator with Multi-Dataflow Fusion 22

Funnel: An Efficient Sparse Attention Accelerator with Multi...

引用

22nd IEEE International Symposium on parallel and distributed Processing with Applications, ISPA 2024

作者： Ma, Shenghong Xu, Jinwei Jiang, Jingfei Wang, Yaohua Li, Dongsheng National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing College of Computer Changsha China

ISBN: (纸本)9798331509712

The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A3, SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup. © 2024 IEEE.

关键词： FPGA Funnel computing unit Sparse attention Transformer

来源：评论

学校读者我要写书评

暂无评论

Speculative symbolic execution

Speculative symbolic execution

引用

2012 IEEE 23rd International Symposium on Software Reliability Engineering, ISSRE 2012

作者： Zhang, Yufeng Chen, Zhenbang Wang, Ji National Laboratory for Parallel and Distributed Processing Department of Computing Science National University of Defense Technology Changsha China

ISBN: (纸本)9780769548883

Symbolic execution is an effective path oriented and constraint based program analysis technique. Recently, there is a significant development in the research and application of symbolic execution. However, symbolic execution still suffers from the scalability problem in practice, especially when applied to large-scale or very complex programs. In this paper, we propose a new fashion of symbolic execution, named Speculative Symbolic Execution (SSE), to speed up symbolic execution by reducing the invocation times of constraint solver. In SSE, when encountering a branch statement, the search procedure may speculatively explore the branch without regard to the feasibility. Constraint solver is invoked only when the speculated branches are accumulated to a specified number. In addition, we present a key optimization technique that enhances SSE greatly. We have implemented SSE and the optimization technique on Symbolic Pathfinder (SPF). Experimental results on six programs show that, our method can reduce the invocation times of constraint solver by 20.7% to 48.7% (with an average of 29.9%), and save the search time from 23.6% to 43.6% (with an average of 30%). © 2012 IEEE.

关键词： Java programming language

来源：评论

学校读者我要写书评

暂无评论

High Performance Interconnect Network for Tianhe System

引用

Journal of Computer Science & Technology 2015年第2期30卷 259-272页

作者：廖湘科庞征王克非卢宇彤谢旻夏军董德尊所光 College of Computer National University of Defense Technology Changsha 410073 Science and Technology on Parallel and Distributed Processing Laboratory National Changsha 410073 China China University of Defense Technology State Key Laboratory of High Performance Computing National University of Defense Technology Changsha 410073 China

In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features effectively supporting high performance communications, ranging over remote direct memory access, collective optimization, hardwareenable reliable end-to-end communication, user-level message passing services, etc. Measured hardware performance results are also presented.

关键词： Tianhe-2 supercomputer interconnect network router architecture network interface architecture user-level message passing

来源：评论

学校读者我要写书评

暂无评论

A clustering-based approach for mining dockerfile evolutionary trajectories

引用

Science China(Information Sciences) 2019年第1期62卷 211-213页

作者： Yang ZHANG Huaimin WANG Vladimir FILKOV Key Laboratory of Parallel and Distributed Computing National University of Defense Technology College of Computer National University of Defense Technology DECAL Lab University of California Computer Science Department University of California

Dear editor,Docker1), as a de-facto industry standard [1], enables the packaging of an application with all its dependencies and execution environment in a light-weight, self-contained unit, i.e., *** launching the container from Docker image, developers can easily share the same operating system, libraries, and binaries [2]. As the configuration file, the dockerfile plays an important role,

关键词： A clustering-based approach for mining dockerfile evolutionary trajectories

来源：评论

学校读者我要写书评

暂无评论

Efficient Large Models Fine-tuning on Commodity Servers via Memory-balanced Pipeline parallelism 25

Efficient Large Models Fine-tuning on Commodity Servers via ...

引用

作者： Liu, Yujie Lai, Zhiquan Liu, Weijie Wang, Wei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Large models have achieved impressive performance in many downstream tasks. Using pipeline parallelism to fine-tune large models on commodity GPU servers is an important way to make the excellent performance of large models available to the general public. Previous solutions fail to achieve an efficient memory-balanced pipeline parallelism. In this poster, we introduce a memory load-balanced pipeline parallel solution. This solution balances memory consumption across stages on commodity GPU servers via NVLink bridges. It establishes a new pathway to offload data from GPU to CPU by using the PCIe link of adjacent GPUs connected by the NVLink bridge. Furthermore, our method orchestrates offload operations to minimize the offload latency during large model fine-tuning. Experiments demonstrate that our solution can balance the memory footprint among pipeline stages without sacrificing training performance. © 2023 IEEE.

关键词： Program processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：