检索结果-内蒙古大学图书馆

Increasing Momentum-Like Factors:A Method for Reducing Training Errors on Multiple GPUs

Tsinghua science and technology 2022年第1期27卷 114-126页

作者： Yu Tang Zhigang Kan Lujia Yin Zhiquan Lai Zhaoning Zhang Linbo Qiao Dongsheng Li Science and Technology on Paralled and Distributed Processing Laboratory and College of Computer Science and TechnologyNational University of Defense TechnologyChangsha 473000China

In distributed training,increasing batch size can improve parallelism,but it can also bring many difficulties to the training process and cause training *** this work,we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent(SGD) and Adaptive moment estimation(Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics processing Unit(GPU).A new method that considers momentum to eliminate training errors in distributed training is *** define a Momentum-like Factor(MF) to represent the influence of former gradients on parameter updates in each ***,we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD,Adam,and Nesterov accelerated *** results reveal that increasing MFs is a reliable method for reducing training errors in distributed *** analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.

关键词： multiple Graphics processing Units(GPUs) batch size training error distributed training momentum-like factors

来源：评论

学校读者我要写书评

暂无评论

An intelligent mesh-smoothing method with graph neural networks

引用

Frontiers of Information technology & Electronic Engineering 2025年第3期26卷 367-384页

作者： Zhichao WANG Xinhai CHEN Junjun YAN Jie LIU Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense TechnologyChangsha 410073China Laboratory of Digitizing Software for Frontier Equipment National University of Defense TechnologyChangsha 410073China

In computational fluid dynamics(CFD),mesh-smoothing methods are widely used to refine the mesh quality for achieving high-precision numerical ***,optimization-based smoothing is used for high-quality mesh smoothing,but it incurs significant computational *** works have improved its smoothing efficiency by adopting supervised learning to learn smoothing methods from high-quality ***,they pose difficulties in smoothing the mesh nodes with varying degrees and require data augmentation to address the node input sequence ***,the required labeled high-quality meshes further limit the applicability of the proposed *** this paper,we present graph-based smoothing mesh net(GMSNet),a lightweight neural network model for intelligent mesh *** adopts graph neural networks(GNNs)to extract features of the node’s neighbors and outputs the optimal node *** smoothing,we also introduce a fault-tolerance mechanism to prevent GMSNet from generating negative volume *** a lightweight model,GMSNet can effectively smooth mesh nodes with varying degrees and remain unaffected by the order of input data.A novel loss function,MetricLoss,is developed to eliminate the need for high-quality meshes,which provides stable and rapid convergence during *** compare GMSNet with commonly used mesh-smoothing methods on two-dimensional(2D)triangle *** results show that GMSNet achieves outstanding mesh-smoothing performances with 5%of the model parameters compared to the previous model,but offers a speedup of 13.56 times over the optimization-based smoothing.

关键词： Unstructured mesh Mesh smoothing Graph neural network Optimization-based smoothing

来源：评论

学校读者我要写书评

暂无评论

FMCC-RT: a scalable and fine-grained all-reduce algorithm for large-scale SMP clusters

引用

science China(Information sciences) 2025年第5期68卷 362-379页

作者： Jintao PENG Jie LIU Jianbin FANG Min XIE Yi DAI Zhiquan LAI Bo YANG Chunye GONG Xinjun MAO Guo MAO Jie REN School of Computer Science and Technology National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology National Supercomputer Center in Tianjin School of Computer Science Shaanxi Normal University

All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.

关键词： all-reduce collective communication MPI scalability

来源：评论

学校读者我要写书评

暂无评论

A Heterogeneous KBA Parallel Algorithm for the Cartesian Discrete Ordinates for Multizone Heterogeneous System 8

A Heterogeneous KBA Parallel Algorithm for the Cartesian Dis...

引用

8th International Conference on Computer and Communication Systems, ICCCS 2023

作者： Li, Runhua Liu, Jie National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory Changsha China

ISBN: (纸本)9781665456128

Innovations in powerful high-performance computing (HPC) architecture are enabling high-fidelity whole-core neutron transport simulations at reasonable time. Especially, the currently fashionable heterogeneous architectures make the cost of such simulations at very low level. Neutron distribution of a reactor core is governed by the Boltzmann neutron transport equation (BTE), first viable solutions of which need tremendous computer resources. Among of the high-fidelity numerical methods, the discrete ordinates method (SN) is becoming popular in the reaction design community by taking a good balance between computational cost and accuracy. Recently, MT-3000, which is a multizone heterogeneous architecture with a peak double precision performance of 11.6 TFLOPS, is proposed. In this work, the BTE is solved by the SN with heterogenous Koch-Baker-Alcouffe (KBA) parallel algorithms based on the MT-3000 architecture. A communication mechanism has been established to efficiently transmit data among the acceleration cores and the CPU cores. The kernel computation procedure is largely accelerated by the vectorization and instruction pipelining techniques. Numerical experiments show that our formulation could achieve 1.37 TFLOPs with single MT-3000, that is 11.8% of its peak performance. © 2023 IEEE.

关键词： Parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

AFMA-Track: Adaptive Fusion of Motion and Appearance for Robust Multi-object Tracking 27th

AFMA-Track: Adaptive Fusion of Motion and Appearance for ...

引用

27th International Conference on Pattern Recognition, ICPR 2024

作者： Liao, Wei Luo, Lei Zhang, Chunyuan College of Computer Science and Technology National University of Defence Technology Changsha China Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Science and Technology National University of Defense Technology Changsha China

ISBN: (纸本)9783031784439

Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distinctive appearance representations, the use of appearance and motion cues is often confined to simplistic association techniques. For instance, fixed weights are commonly employed to combine the intersection-over-union (IoU) matrix and appearance similarity matrix, yielding an association cost matrix. To harness the full potential of motion and appearance cues across diverse scenarios, we propose an innovative approach that dynamically balances motion and appearance cues based on scene and object information during the association process. Furthermore, we introduce a new mechanism for updating appearance representations, effectively mitigating noise introduced by occlusion. Our method demonstrates state-of-the-art performance on the MOT17 and MOT20 test sets. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

An unsupervised deep learning framework for gene regulatory network inference from single-cell expression data

An unsupervised deep learning framework for gene regulatory ...

引用

2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023

作者： Mao, Guo Liu, Jie National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory Changsha410073 China National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Laboratory of Software Engineering for Complex System Changsha410073 China

ISBN: (纸本)9798350337488

Recent advances in single-cell RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for reconstruction gene regulation networks (GRNs). At present, many different models have been proposed to infer GRN from a large number of RNA-seq data, but most deep learning models use a priori gene regulatory network to infer potential GRNs. It is a challenge to reconstruct GRNs from scRNA-seq data due to the noise and sparsity introduced by the dropout effect. Here, we propose GAALink, a novel unsupervised deep learning method. It first constructs the gene similarity matrix and then refines it by threshold value. It then learns feature representations of genes through a graphical attention autoencoder that propagates information across genes with different weights. Finally, we use gene feature expression for matrix completion such that the GRNs are reconstructed. Compared with seven existing GRNs reconstruction methods, GAALink achieves more accurate performance on seven scRNA-seq dataset with four ground truth networks. GAALink can provide a useful tool for inferring GRNs for scRNA-seq expression data. © 2023 IEEE.

关键词： Feature representations Gene similarity matrix Graphical attention autoencoder Matrix completion Unsupervised deep learning method

来源：评论

学校读者我要写书评

暂无评论

Guided Spatio-Temporal Learning Method for 4K Video Super-Resolution 5

Guided Spatio-Temporal Learning Method for 4K Video Super-Re...

引用

5th ACM International Conference on Multimedia in Asia, MMAsia 2023

作者： Jiang, Qin Wang, Qinglin Liu, Jie Science and Technology on Parallel and Distributed Processing Laboratory Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology China

ISBN: (纸本)9798400702051

4K Video Super-Resolution (VSR) presents a challenging task in video processing, as most existing VSR models have high computational complexity, limiting their application to high-resolution videos, particularly for 4K resolution videos. To address this issue, we propose a novel Guided Spatio-Temporal Video Super-Resolution network (GST-VSR) designed to perform 4K VSR on a single GPU. The proposed method comprises two key components: the Spatio-Temporal Alignment Network (STAN) and the Super-resolution Reconstruction Network (SRN), which work together to enhance the quality of the output frames. The STAN is responsible for extracting highly relevant features in frames and aligning the reference frame with the neighboring frames at the feature level to maintain temporal consistency. The SRN fuses high-quality features into the final high-resolution frames. Unlike existing methods, our proposed approach does not require explicit optical flow estimation, making it more efficient and less computationally demanding. To facilitate the training and testing of the compared models, we have established a new dataset, Pixabay-Set, consisting of 145 videos suitable for the 4K VSR task. Experimental results on the test dataset show that the proposed method achieves competitive performance compared to state-of-the-art models. In summary, our proposed GST-VSR network provides an effective solution to the challenging task of 4K VSR. © 2023 Copyright held by the owner/author(s).

关键词： Statistical tests

来源：评论

学校读者我要写书评

暂无评论

Smoothing Point Adjustment-Based Evaluation of Time Series Anomaly Detection 48

Smoothing Point Adjustment-Based Evaluation of Time Series A...

引用

48th IEEE International Conference on Acoustics, Speech and Signal processing, ICASSP 2023

作者： Liu, Mingyu Wang, Yijie Xu, Hongzuo Zhou, Xiaohui Li, Bin Wang, Yongjun National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Changsha China

ISBN: (纸本)9781728163277

Anomalies in time series appear consecutively, forming anomaly segments. Applying the classical point-based evaluation metrics to evaluate the detection performance of segments leads to considerable underestimation, so most related studies resort to point adjustment. This operation treats all points as true positives within a segment equally when only one individual point alarms, resulting in significant overestimation and creating an illusion of superior performance. This paper proposes smoothing point adjustment, a novel range-based evaluation protocol for time series anomaly detection. Our protocol reflects detection performance impartially by carefully considering the specific location and frequency of alarms in the raw results. It is achieved by smoothly determining the adjustment range and rewarding early detection via a ranging function and a rewarding function. Compared with other evaluation metrics, experiments on different datasets show that our protocol can yield a performance ranking of various methods more consistent with the desired situation. © 2023 IEEE.

关键词： Anomaly Detection Evaluation Protocol Point Adjustment Time Series

来源：评论

学校读者我要写书评

暂无评论

DaCP: Accelerating Synchronization-Free SpTRSV via GPU-Friendly Data Communication and Parallelism Strategies 20th

DaCP: Accelerating Synchronization-Free SpTRSV via GPU-Frie...

引用

20th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2024

作者： Guo, Mingfeng Deng, Liang Dai, Zhe Li, Ruitian Lin, Gaofeng Liu, Jie Computational Aerodynamics Institute China Aerodynamics Research and Development Center Mianyang China Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha China

ISBN: (纸本)9789819628292

Sparse triangular solve (SpTRSV) is a vital component in various scientific applications, and numerous GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSV is currently the mainstream algorithm on GPU due to its short preprocessing time and outstanding performance. However, we observed that this algorithm still has two performance bottlenecks. Firstly, the thread-level parallel mode can introduce to thread divergence issues within GPU warps during the writing phase. Secondly, the thread-level and warp-level fusion mode may struggles to fully exploit GPU resources due to suboptimal mapping relationships between rows and threads. To address these issues, this paper proposes DaCPSpTRSV, a new synchronization-free algorithm with GPU-friendly data communication and parallelism strategies. Specifically, we first develop a fast-forward thread-level approach, incorporating an efficient global memory access pattern and a light-weight dependency control mechanism, to optimize data communication and alleviate thread divergence. A fine-grained fusion strategy is then proposed to maximize GPU parallelism by adaptively selecting the suitable thread-level or warp-level modes. Moreover, the commonly-used compressed sparse row (CSR) format is employed in our DaCPSpTRSV, enhancing the versatility of our algorithm. We evaluate our approach using 245 matrices from the SuiteSparse Matrix Collection on two NVIDIA GPUs, demonstrating speedup ratios of up to 4.77×, 4.94×, 1.67×, and 1.62× compared to cuSPARSE, Sync-Free, CapelliniSpTRSV, and YuenyeungSpTRSV, respectively. The project is open-sourced at https://***/gmfff12334/DaCP. © IFIP International Federation for Information processing 2025.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

EdgeAnchor: A Rapid and Balanced File Storage Strategy at the Network Edge 29

EdgeAnchor: A Rapid and Balanced File Storage Strategy at th...

引用

29th IEEE International Conference on Parallel and distributed Systems, ICPADS 2023

作者： Liu, Han Xie, Xingrui Zhang, Zhuopu Cheng, Geyao Luo, Lailong Guo, Deke National University of Defense Technology Science and Technology on Information Systems Engineering Laboratory China National University of Defense Technology National Laboratory for Parallel and Distributed Processing China

ISBN: (纸本)9798350330717

Storing files at the network edge has become a new paradigm of storage systems, which is promising to mitigate network congestion and reduce file retrieval latency. However, the traditional file storage scheme cannot effectively meet the requirements of rapid indexing and load balance when applied directly to the edge. Moreover, due to the dynamic nature of the edge environment where edge servers can join or leave at will, it is necessary for the storage scheme to adjust with minimal disruption. In this paper, we propose EdgeAnchor, a novel edge storage strategy that is composed of the two-layer hash mappings. The first layer, file-to-bucket mapping, adopts the pseudo-deletion algorithm to deal with the variations in file size, while the second layer utilizes the multiple bucket-to-server mapping to adapt to the heterogeneity in the servers' storage capacities. Furthermore, EdgeAnchor constructs a list of deleted or added working sets for each bucket and creates a dictionary for the mappings between buckets and edge servers. In the manner, EdgeAnchor ensures a rapid file index and balances server load at the dynamic network edge. We also attach the mathematical analyses to EdgeAnchor, which theoretically proves its logarithmic complexity of hash operations and memory accesses. The experiments conducted on real-world datasets demonstrate that EdgeAnchor achieves the file index throughput twice as high as that of Consistent Hashing, under the constraints of load balance. Additionally, it ensures a low and stable data migration volume, when adding or removing edge servers consecutively. © 2023 IEEE.

关键词： dynamic network edge file storage load balance rapid indexing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：