Breadth-first search(BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effecti...
详细信息
Breadth-first search(BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effective solution, GPU-acceleration achieves the state-of-the-art result of 3.3×109 traversed edges per second on a NVIDIA Tesla C2050 GPU. A novel vertex frontier based GPU BFS algorithm is proposed, and its main features are three-fold. Firstly, to obtain a better workload balance for irregular graphs, a virtual-queue task decomposition and mapping strategy is introduced for vertex frontier expanding. Secondly, a global deduplicate detection scheme is proposed to remove reduplicative vertices from vertex frontier effectively. Finally, a GPU-based bottom-up BFS approach is employed to process large frontier. The experimental results demonstrate that the algorithm can achieve 10% improvement over the state-of-the-art method on diverse graphs. Especially, it exhibits 2-3 times speedup on low-diameter and scale-free graphs over the state-of-the-art on a NVIDIA Tesla K20 c GPU, reaching a peak traversal rate of 11.2×109 edges/s.
A basic technique for designing synchronous parallel algorithms, the so-called bisection technique, is proposed. The basic pattern of designing parallel algorithms is described. The relationship between the designing ...
详细信息
A basic technique for designing synchronous parallel algorithms, the so-called bisection technique, is proposed. The basic pattern of designing parallel algorithms is described. The relationship between the designing idea and I Ching (principles of change) is discussed.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...
详细信息
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interpr...
详细信息
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.
The fully coupled pressure-based algorithm is widely recognised for its superior convergence and robustness in solving incompressible flow problems. However, the increased scale of equations and the difficulty in solv...
详细信息
The determination of virtual constraints is always one of the key and difficult problems in traditional mobility calculation. To make mobility calculation simple, considering avoiding virtual constraints, some new for...
详细信息
The determination of virtual constraints is always one of the key and difficult problems in traditional mobility calculation. To make mobility calculation simple, considering avoiding virtual constraints, some new formulae have been presented, however these formulae can hardly intuitively reflect general link group's restrictions on output member and its influences on independence of output parameters, which is premise to the judgment of the properties of mobility. Towards the problem to reveal the intrinsic relationship between the degree of freedom(DOF) of a mechanism, the link group, and the dimension of output parameters, also to avoid determination of virtual constraint, based on the new concepts of the "DOF of general link group" and "node parameters", a new formula in the calculation of the mobility of mechanisms is presented that is expressed with DOFs of the general link groups and rank of motion parameters of base point of the output link. It is named GOM(mobility of groups and output parameter) formula. On the basis of new concepts of"effective parameters" and "invalid parameters", a rule is put forward for solving the DOF of mechanisms with invalid parameters by GOM formula, that is, the base point parameters are the subset of effective parameters of link group. Thereafter, several examples are enumerated and the results coincide with the prototype data, which proves the validity of the proposed formula. Meanwhile, it is obtained that the necessary and sufficient condition for the judgment of output parameters independence is that each of the DOF of the link group is not less than zero. The proposed formula which is simple in calculation provides theoretical basis for the judgment of independence of output parameters and provides references for type synthesis of novel parallel mechanisms with independence requirements of their output parameters.
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been...
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been one and a half year since its predecessor, MilkyWay-1 (TH-1), reached the same place for the first time. On the newest Top500 list published in November 2013, MilkyWay-2 continued to win the champion.
In this paper the AGE iterative method is applied to the set of linear 3 term recurrence equations derived from the cubic spline approximations to the one dimensional diffusion equation. Convergence and stability for ...
详细信息
In this paper the AGE iterative method is applied to the set of linear 3 term recurrence equations derived from the cubic spline approximations to the one dimensional diffusion equation. Convergence and stability for the method is proved and the derivation and existence of the optimal acceleration parameters for the stationary and nonstationary forms of the method established.
Two type of structures-Turing patterns and spiral waves-are obtained in chloride-iodide-malonic acid (CIMA) reaction-diffusion model by using lattice Bhatnagar-Gross-Krook (LBGK) method.
Two type of structures-Turing patterns and spiral waves-are obtained in chloride-iodide-malonic acid (CIMA) reaction-diffusion model by using lattice Bhatnagar-Gross-Krook (LBGK) method.
We have developed a SIMD-type neural-network processor (NEURO4) and its software environment. With the SIMD architecture, the chip executes 24 operations in a clock cycle and achieves 1.2 GFLOPS peak performance. An a...
详细信息
We have developed a SIMD-type neural-network processor (NEURO4) and its software environment. With the SIMD architecture, the chip executes 24 operations in a clock cycle and achieves 1.2 GFLOPS peak performance. An accelerator board, which contains four NEURO4 chips, achieves 3.2 GFLOPS. In this paper we describe features of the neural network chip, accelerator board, software environment and performance evaluation for several neural network models (LVQ, BP and Hopfield). The 3.2 GFLOPS neural network accelerator board demonstrates 1.7 GCPS and 261 MCUPS for Hopfield networks.
暂无评论