检索结果-内蒙古大学图书馆

GPU acceleration of subgraph isomorphism search in large scale graph

Journal of Central South University 2015年第6期22卷 2238-2249页

作者：杨博卢凯高颖慧王小平徐凯 Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology College of Computer National University of Defense Technology Department of Electronic Science and Engineering National University of Defense Technology

A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves the subgraph isomorphism in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, a task-queue based method and the virtual-CSR graph structure are used to balance the workload among warps, and warp-centric programming model is used to balance the workload among threads in a warp. The prototype of GPUSI is implemented, and comprehensive experiments of various graph isomorphism operations are carried on diverse large graphs. The experiments clearly demonstrate that GPUSI has good scalability and can achieve speed-up of 1.4–2.6 compared to the state-of-the-art solutions.

关键词： parallel graph isomorphism GPU backtrack paradigm

来源：评论

学校读者我要写书评

暂无评论

Speeding up the MATLAB complex networks package using graphic processors

引用

Chinese Physics B 2011年第9期20卷 460-467页

作者：张百达唐玉华吴俊杰李鑫 National laboratory for Parallel and Distributed Processing School of ComputerNational University of Defense Technology Department of Computer Science and Technology School of ComputerNational University of Defense Technology

The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units （GPU）. This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.

关键词： complex networks graphic processors unit MATLAB Jacket Toolbox

来源：评论

学校读者我要写书评

暂无评论

SPICE modeling of memristors with multilevel resistance states

引用

Chinese Physics B 2012年第9期21卷 594-600页

作者：方旭东唐玉华吴俊杰 National Laboratory for Parallel and Distributed Processing School of ComputerNational University of Defense Technology Department of Computer Science and Technology School of ComputerNational University of Defense Technology

With CMOS technologies approaching the scaling ceiling, novel memory technologies have thrived in recent years, among which the memristor is a rather promising candidate for future resistive memory （RRAM）. Memristor＇s potential to store multiple bits of information as different resistance levels allows its application in multilevel cell （MCL） tech- nology, which can significantly increase the memory capacity. However, most existing memristor models are built for binary or continuous memristance switching. In this paper, we propose the simulation program with integrated circuits emphasis （SPICE） modeling of charge-controlled and flux-controlled memristors with multilevel resistance states based on the memristance versus state map. In our model, the memristance switches abruptly between neighboring resistance states. The proposed model allows users to easily set the number of the resistance levels as parameters, and provides the predictability of resistance switching time if the input current/voltage waveform is given. The functionality of our models has been validated in HSPICE. The models can be used in multilevel RRAM modeling as well as in artificial neural network simulations.

关键词： memristor multilevel cell SPICE model

来源：评论

学校读者我要写书评

暂无评论

Betweenness-based algorithm for a partition scale-free graph

引用

Chinese Physics B 2011年第11期20卷 556-564页

作者：张百达吴俊杰唐玉华周静 National Laboratory for Parallel and Distributed Processing School of ComputersNational University of Defense Technology Department of Computer Science and Technology School of ComputersNational University of Defense Technology

Many real-world networks are found to be scale-free. However, graph partition technology, as a technology capable of parallel computing, performs poorly when scale-free graphs are provided. The reason for this is that traditional partitioning algorithms are designed for random networks and regular networks, rather than for scale-free networks. Multilevel graph-partitioning algorithms are currently considered to be the state of the art and are used extensively. In this paper, we analyse the reasons why traditional multilevel graph-partitioning algorithms perform poorly and present a new multilevel graph-partitioning paradigm, top down partitioning, which derives its name from the comparison with the traditional bottom-up partitioning. A new multilevel partitioning algorithm, named betweenness-based partitioning algorithm, is also presented as an implementation of top-down partitioning paradigm. An experimental evaluation of seven different real-world scale-free networks shows that the betweenness-based partitioning algorithm significantly outperforms the existing state-of-the-art approaches.

关键词： graph partitioning betweenness-based partitioning algorithm scale free network

来源：评论

学校读者我要写书评

暂无评论

A pipelining strategy for accelerating convolution neural networks on ARM CPUs

A pipelining strategy for accelerating convolution neural ne...

引用

作者： Zhou, Xin Dou, Yong Li, Rongchun Zhang, Peng Liu, Yuntao National Key Laboratory for Parallel and Distribution Processing National University of Defense Technology Changsha China

Convolution is a primary operation in convolution neural networks. The speed of inference is mainly decided by the speed of the convolutional layer. Improving the performance of embedded processors makes it possible to process the inference on embedded devices. In this article, a pipelining strategy of single instruction and multiple data (SIMD) instructions is proposed to finely optimize the process of the 3 × 3 convolution on ARM-based CPUs. We implement the SIMD group to improve the efficiency of the SIMD pipeline. A tiling method is exploited to increase data reuse during the process. An evaluation model is proposed to guide the design of the tiling method and register allocation. The speed of our implementation is 5.18 times of the GNU compiler collection compiled unoptimized version on RK3288. The effect of our optimizing method is measured by a performance profiling tool, the performance information suggests that the pipelining strategy has a significant effect for both normal and depthwise separable convolution. By implementing multithread processing, the speedup achieves 18.3 compared with the single thread unoptimized version. © 2021 John Wiley & Sons Ltd

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

A parallel turbo product codes decoder based on graphics processing units 21

A parallel turbo product codes decoder based on graphics pro...

引用

21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data science and Systems, HPCC/SmartCity/DSS 2019

作者： Zhou, Xin Li, Rongchun National Key Laboratory for Parallel and Distribution Processing National University of Defense Technology Changsha China

ISBN: (纸本)9781728120584

Turbo product codes (TPC) are a class of forward error correction (FEC) codes. They have good bit error rate (BER) performance at high code rate. It is relatively simple to implement the encoder of TPC and the decoding complexity of their decoder is reasonable. Therefore, TPC are widely used in various places such as satellite communication systems and data storage systems. In this paper, a parallel TPC decoder based on GPUs is proposed. All rows or columns of the two-dimensional product code matrix are decoded simultaneously in this proposed decoder. A parallel elementary decoder is designed to simplify the decoding process of TPC which are constructed by extended Hamming codes. The calculations of test patterns and valid codewords are parallelled to reduce decoding latency. In order to further improve the decoding throughput, we present the multi-channels TPC decoder. The performance of the parallel decoder is measured on different GPUs. The experiment result shows that the decoding latency is reduced significantly compared with the TPC decoder based on a CPU. In addition, throughputs of proposed GPU decoder achieve 30Mbps on Nvidia RTX 2080 Ti and 38 Mbps on Nvidia Titan V, which are 44 times and 54 times of the CPU-based decoder. © 2019 IEEE.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Simulation study of N-hit SET variation in differential cascade voltage switch logical circuits

引用

science China(Information sciences) 2015年第2期58卷 165-173页

作者： HUANG PengCheng CHEN ShuMing CHEN JianJun WU ZhenYu LIANG ZhengFa HU ChunMei LIANG Bin LIU BiWei Micro-electronics and Microprocessor Institute College of Computer ScienceNational University of Defense Technology National Laboratory for Parallel and Distributed Processing College of Computer ScienceNational University of Defense Technology

The advancement in the process leads to more concern about the Single Event(SE) sensitivity of the Differential Cascade Voltage Switch Logic(DCVSL) circuits. The simulation results indicate that the Single Event Transient(SET) generated at the DCVSL gate is much larger than that at the ordinary CMOS gate, and their SET variation is different. Based on charge collection, in this paper, the effective collection time theory is proposed to set forth the SET pulse generated at the DCVSL gate. Through 3D TCAD mixed-mode simulation in 65 nm twin-well bulk CMOS process, the effects on SET variation of device parameters such as well contact size and environment parameters such as voltage are investigated.

关键词： differential cascade voltage switch logic(DCVSL) single event transient(SET) effective collection time pulse feedback feature(PFF) across-coupled structure

来源：评论

学校读者我要写书评

暂无评论

MilkyWay-2 supercomputer： system and application

引用

Frontiers of Computer science 2014年第3期8卷 345-356页

作者： Xiangke LIAO Liquan XIAO Canqun YANG Yutong LU Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha 410073 China College of Computer National University of Defense Technology Changsha 410073 China

On June 17, 2013, MilkyWay-2 （Tianhe-2） supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.

关键词： MilkyWay-2 supercomputer petaflops computing neo-heterogeneous architecture interconnect network heterogeneous programing model system management benchmark optimization performance evaluation

来源：评论

学校读者我要写书评

暂无评论

The TH Express high performance interconnect networks

引用

Frontiers of Computer science 2014年第3期8卷 357-366页

作者： Zhengbin PANG Min XIE Jun ZHANG Yi ZHENG Guibin WANG Dezun DONG Guang SUO Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha 410073 China College of Computer National University of Defense Technology Changsha 410073 China

Interconnection network plays an important role in scalable high performance computer （HPC） systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.

关键词： HPC network interface chip （NIC） TH Express nterconnect offload collective operation

来源：评论

学校读者我要写书评

暂无评论

Jammer Localization for Wireless Sensor Networks

引用

电子学报(英文版) 2011年第4期20卷 735-738页

作者： SUN Yanqiang WANG Xiaodong ZHOU Xingming National Key Laboratory for Parallel and Distributed Processing College of Computer Science National University of Defense Technology Changsha China

Jamming attack can severely affect the performance of Wireless sensor networks (WSNs) due to the broadcast nature of wireless medium. In order to localize the source of the attacker, we in this paper propose a jammer localization algorithm named as Minimum-circlecovering based localization (MCCL). Comparing with the existing solutions that rely on the wireless propagation parameters, MCCL only depends on the location information of sensor nodes at the border of the jammed region. MCCL uses the plane geometry knowledge, especially the minimum circle covering technique, to form an approximate jammed region, and hence the center of the jammed region is treated as the estimated position of the jammer. Simulation results showed that MCCL is able to achieve higher accuracy than other existing solutions in terms of jammer's transmission range and sensitivity to nodes' density.

关键词：无线传感器网络干扰定位传感器节点位置信息覆盖技术定位算法无线传播几何知识

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：