A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves ...
详细信息
A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves the subgraph isomorphism in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, a task-queue based method and the virtual-CSR graph structure are used to balance the workload among warps, and warp-centric programming model is used to balance the workload among threads in a warp. The prototype of GPUSI is implemented, and comprehensive experiments of various graph isomorphism operations are carried on diverse large graphs. The experiments clearly demonstrate that GPUSI has good scalability and can achieve speed-up of 1.4–2.6 compared to the state-of-the-art solutions.
Sparse bundle adjustment(SBA) is a key but time-and memory-consuming step in three-dimensional(3 D) reconstruction. In this paper, we propose a 3 D point-based distributed SBA algorithm(DSBA) to improve the speed and ...
详细信息
Sparse bundle adjustment(SBA) is a key but time-and memory-consuming step in three-dimensional(3 D) reconstruction. In this paper, we propose a 3 D point-based distributed SBA algorithm(DSBA) to improve the speed and scalability of SBA. The algorithm uses an asynchronously distributed sparse bundle adjustment(A-DSBA)to overlap data communication with equation computation. Compared with the synchronous DSBA mechanism(SDSBA), A-DSBA reduces the running time by 46%. The experimental results on several 3 D reconstruction datasets reveal that our distributed algorithm running on eight nodes is up to five times faster than that of the stand-alone parallel SBA. Furthermore, the speedup of the proposed algorithm(running on eight nodes with 48 cores) is up to41 times that of the serial SBA(running on a single node).
With CMOS technologies approaching the scaling ceiling, novel memory technologies have thrived in recent years, among which the memristor is a rather promising candidate for future resistive memory (RRAM). Memristor...
详细信息
With CMOS technologies approaching the scaling ceiling, novel memory technologies have thrived in recent years, among which the memristor is a rather promising candidate for future resistive memory (RRAM). Memristor's potential to store multiple bits of information as different resistance levels allows its application in multilevel cell (MCL) tech- nology, which can significantly increase the memory capacity. However, most existing memristor models are built for binary or continuous memristance switching. In this paper, we propose the simulation program with integrated circuits emphasis (SPICE) modeling of charge-controlled and flux-controlled memristors with multilevel resistance states based on the memristance versus state map. In our model, the memristance switches abruptly between neighboring resistance states. The proposed model allows users to easily set the number of the resistance levels as parameters, and provides the predictability of resistance switching time if the input current/voltage waveform is given. The functionality of our models has been validated in HSPICE. The models can be used in multilevel RRAM modeling as well as in artificial neural network simulations.
Many real-world networks are found to be scale-free. However, graph partition technology, as a technology capable of parallel computing, performs poorly when scale-free graphs are provided. The reason for this is that...
详细信息
Many real-world networks are found to be scale-free. However, graph partition technology, as a technology capable of parallel computing, performs poorly when scale-free graphs are provided. The reason for this is that traditional partitioning algorithms are designed for random networks and regular networks, rather than for scale-free networks. Multilevel graph-partitioning algorithms are currently considered to be the state of the art and are used extensively. In this paper, we analyse the reasons why traditional multilevel graph-partitioning algorithms perform poorly and present a new multilevel graph-partitioning paradigm, top down partitioning, which derives its name from the comparison with the traditional bottom-up partitioning. A new multilevel partitioning algorithm, named betweenness-based partitioning algorithm, is also presented as an implementation of top-down partitioning paradigm. An experimental evaluation of seven different real-world scale-free networks shows that the betweenness-based partitioning algorithm significantly outperforms the existing state-of-the-art approaches.
The advancement in the process leads to more concern about the Single Event(SE) sensitivity of the Differential Cascade Voltage Switch Logic(DCVSL) circuits. The simulation results indicate that the Single Event Trans...
详细信息
The advancement in the process leads to more concern about the Single Event(SE) sensitivity of the Differential Cascade Voltage Switch Logic(DCVSL) circuits. The simulation results indicate that the Single Event Transient(SET) generated at the DCVSL gate is much larger than that at the ordinary CMOS gate, and their SET variation is different. Based on charge collection, in this paper, the effective collection time theory is proposed to set forth the SET pulse generated at the DCVSL gate. Through 3D TCAD mixed-mode simulation in 65 nm twin-well bulk CMOS process, the effects on SET variation of device parameters such as well contact size and environment parameters such as voltage are investigated.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...
详细信息
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
Jamming attack can severely affect the performance of Wireless sensor networks (WSNs) due to the broadcast nature of wireless medium. In order to localize the source of the attacker, we in this paper propose a jammer ...
详细信息
Jamming attack can severely affect the performance of Wireless sensor networks (WSNs) due to the broadcast nature of wireless medium. In order to localize the source of the attacker, we in this paper propose a jammer localization algorithm named as Minimum-circlecovering based localization (MCCL). Comparing with the existing solutions that rely on the wireless propagation parameters, MCCL only depends on the location information of sensor nodes at the border of the jammed region. MCCL uses the plane geometry knowledge, especially the minimum circle covering technique, to form an approximate jammed region, and hence the center of the jammed region is treated as the estimated position of the jammer. Simulation results showed that MCCL is able to achieve higher accuracy than other existing solutions in terms of jammer's transmission range and sensitivity to nodes' density.
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interpr...
详细信息
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been...
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been one and a half year since its predecessor, MilkyWay-1 (TH-1), reached the same place for the first time. On the newest Top500 list published in November 2013, MilkyWay-2 continued to win the champion.
Aero engine blade structure is prone to severe fretting wear under centrifugal forces and vibration loads, resulting in greatly reduced engine service life, which is a typical fretting damage problem. In this paper, w...
详细信息
Aero engine blade structure is prone to severe fretting wear under centrifugal forces and vibration loads, resulting in greatly reduced engine service life, which is a typical fretting damage problem. In this paper, without Absorbing layer Nanosecond Laser Shock Peening (wAN-LSP) and without Absorbing layer Nanosecond superimposed Femtosecond hybrid Laser Shock Peening (NF-LSP) are utilized to strengthen the surface of GH4169 dovetail joint specimens, which is a common material for engine blades, to explore its strengthening mechanism and fretting wear evolution law. The findings indicate that the surface roughness of the wAN-LSP specimen rises from 0.28 μm to 1.16 μm. The thermal effects of wAN-LSP lead to the formation of a molten layer roughly 10.97 μm thick on the surface, accompanied by numerous pits, ablation holes, and micro-cracks. In contrast, NF-LSP treatment effectively eliminates the surface oxidation defects made by the thermal effects of wAN-LSP, leading to a slight decrease in surface roughness. Both wAN-LSP and NF-LSP treatments enhance the surface hardness of the specimens, creating a plastic deformation layer roughly 400 μm deep, but the near surface hardness of NF-LSP treated specimen is further improved compared to that of the wAN-LSP specimen. The wear mechanism observed in the GH4169 dovetail joint specimens following wAN-LSP and NF-LSP treatment are identified as abrasive wear and oxidation wear. Initially, as the number of cycles raises, the wear volume of GH4169 specimen treated with wAN-LSP is higher than that of untreated specimen, but subsequently, it becomes lesser than that of the untreated specimen. This phenomenon results from the interplay between the molten layer and hardened layer. In the initial stage of wear, despite the surface strengthening of the wAN-LSP specimen, the presence of molten layer significantly diminishes the fretting wear resistance of the material. As wear progresses, the molten layer is removed, revealing the be
暂无评论