Withthe increasing of scale and complexity of high performance computing (HPC) systems, the programming, debugging, and tuning of large-scale parallel programs face a series of challenges, one of which is that progra...
详细信息
ISBN:
(纸本)9781728111414
Withthe increasing of scale and complexity of high performance computing (HPC) systems, the programming, debugging, and tuning of large-scale parallel programs face a series of challenges, one of which is that programmers often need to repeatedly run their programs with large number of processes on HPC systems to identify sources of errors and performance bottlenecks in their programs, which means large amounts of resource consumptions. Furthermore, since most HPC systems use job scheduling system to manage their resources and schedule multiple jobs from different users, programmers cannot interact withtheir programs during the execution of programs, which further increases complexities of debugging and tuning. To address this challenge, this paper proposes a system that reruns large-scale MPI parallel programs using two nodes. According to an approach of one real-execution + multiple emulation-executions, the parallel program is firstly executed with desired number of processes on an HPC system, which is referred as real-execution, and during the execution, our system records MPI messages transmitted among processes as well as control information of processes;after that, one or more processes can be re-run on a two-node local system under the scale the same withthe real-execution. In the meantime, programmers can interact withtheir programs by attaching the GDB, a commonly used debugger, to the re-running process. therefore, not only can our system reduce resource-consumptions in debugging and tuning of large-scale parallel programs significantly, but also support interactions between developers and their programs during the execution of the programs, which makes programmers easier to identify sources of the errors and performance bottlenecks in their parallel programs.
Index-shuffle graphs are introduced as candidate interconnection networks for parallel computers. the comparative advantages of index-shuffle graphs over the standard bounded-degree ''approximations'' ...
详细信息
ISBN:
(纸本)0818676833
Index-shuffle graphs are introduced as candidate interconnection networks for parallel computers. the comparative advantages of index-shuffle graphs over the standard bounded-degree ''approximations'' of the hypercube, namely butterfly-like and shuffle-like graphs, are demonstrated in the theoretical framework of graph embedding and network emulations. An N-node index-shuffle graph emulates: (1) an N-node shuffle-exchange graph with no slowdown, while the currently best emulations of shuffle-like graphs by hypercubes and butterflies incur a slowdown of Omega(log N). (2) its like-sized buttelfly graph with a slowdown O(log log log N), while the currently best emulations of butterfly-like graphs by shuffle-like graphs incur a slow-down of Omega (log log N). (3) an N-node hypercube that executes an on-line leveled algorithm with a slowdown O(log log N) and without data circulation, while the slowdown of currently best such emulations of the hypercube by its bounded-degree shuffle-like and butterfly-like derivatives remains Omega(log N), and only if the entire local data set of every processor is allowed to circulate through the network.
Lattice quantum chromodynamics (lattice QCD) is a mainstream non-perturbative theoretical calculation method. It's for studying quantum chromodynamics (QCD) using lattice quantum field theory, by defining field va...
详细信息
ISBN:
(纸本)9781728111414
Lattice quantum chromodynamics (lattice QCD) is a mainstream non-perturbative theoretical calculation method. It's for studying quantum chromodynamics (QCD) using lattice quantum field theory, by defining field variables at discrete time-space points and large-scale Monte Carlo numerical simulation calculation. It's computing results can directly compare withthe experimental results, but the conventional computing platform is difficult to meet the large-scale and high-precision lattice QCD computational simulation demand. Sunway TaihuLight is the first supercomputer with peak performance over 100Pflops in the world, which provides a new platform for the calculation of lattice QCD. But the efficient large-scale parallel lattice QCD computing faces many difficult problems in implement. In order to realize the efficient calculation of the lattice QCD in Sunway many-core processor, we designs a parallel acceleration calculation method of lattice QCD on Sunway Architecture in this paper. A new parallel computing method is put forward, and the method of data segmentation, data transmission and parallel computing is improved and optimized. Finally, the test data is used to test the optimized parallelization computing method proposed in this paper and the original serial computing method. Experiments show that the parallel optimized computing method can achieve 63 times performance improvement compared withthe original serial computing method.
Testing for parallel software system is very difficult, because the number of states and execution sequences expands significantly caused by parallel behaviors. In this paper, a test sequences generation method based ...
详细信息
ISBN:
(纸本)9781728111414
Testing for parallel software system is very difficult, because the number of states and execution sequences expands significantly caused by parallel behaviors. In this paper, a test sequences generation method based on Coloured Petri Net (CPN) is proposed, target at full coverage of occurrences in different orders to all behaviors under testing, and generating non-redundant test sequences. Firstly, test objective is described as segments of behaviors under testing, and the coverage criterion is defined as occurrences in different orders to all behavior under testing should be covered. Secondly, the Projection operation is defined to get the state space subgraph for each segment of behavior under testing, and all occurrences of behaviors in the segment are in the subgraph, so that test sequences could be got in the state space subgraph but not the full state space of the system model, the efficiency has improved to some extent. However, the state space subgraph is always big caused by complex parallel behaviors including behaviors under testing and other behaviors. then State Node Pruning operation and Arc Pruning operation are performed in the subgraph to greatly reduce the scale of subgraph, and all states caused by other behaviors are pruned. the simplified subgraph still covers the occurrences in different orders to all behaviors under testing. So test sequences can be generated in the simplified subgraph. Finally, the FullPath operation defines the generation of all test sequences in the subgraph after simplification. By acquiring all sequences of each simplified subgraph and any sequence between adjacent subgraphs, the test sequence that covers all behaviors under testing is generated, and these test sequences do not include any redundant sequences. the experimental results show that this method is effective and efficient.
To avoid signal interference in mobile communication it is necessary that the channels used by base stations for broadcast communication within their cells are chosen so that the same channel is never concurrently use...
详细信息
ISBN:
(纸本)0818676833
To avoid signal interference in mobile communication it is necessary that the channels used by base stations for broadcast communication within their cells are chosen so that the same channel is never concurrently used by two neighboring stations. We model this channel allocation problem as a generalized list coloring problem and we provide two distributed solutions which are also able to cope with crash failures by limiting the size of the network affected by a faulty station in terms of the distance from that station. Our first solution uses a powerful synchronization mechanism to achieve a response time that depends only on Delta, the maximum degree of the signal interference graph, and a failure locality of 4. Our second solution is a simple randomized solution in which each node can expect to pick f/(4 Delta) colors where f is the size of the list at the node;the response time of this solution is a constant and the failure locality 1. Besides being efficient (their complexity measures involve only small constants), the protocols presented in this work are simple and easy to apply in practice, provided the existence of distributed infrastructure in networks that are in use.
this paper addresses the problem of analyzing the performance of parallel algorithms for the training procedure of a neural network based Fingerprint Image Comparison (FIC) system. the target architecture is assumed t...
详细信息
this paper addresses the problem of analyzing the performance of parallel algorithms for the training procedure of a neural network based Fingerprint Image Comparison (FIC) system. the target architecture is assumed to be a coarse-grain distributed memory parallel architecture. Two types of parallelism: node parallelism and training set parallelism (TSP) are investigated. these algorithms are implemented on a 32 node CM-5. theoretical analysis and experimental results comparing the performance of these algorithms are presented.
B-trees are used for accessing large database files, stored in lexicographic order on the secondary storage devices. Algorithms for concurrent B-tree data structures achieve only limited speedup when implemented on a ...
详细信息
B-trees are used for accessing large database files, stored in lexicographic order on the secondary storage devices. Algorithms for concurrent B-tree data structures achieve only limited speedup when implemented on a parallel computer. To improve the performance, a variant of Blink-tree, called Bmad-tree, which allows insertion without node splits, with multiple access in its leaf nodes, and dilation in boththe index and the leaf nodes is proposed.
We consider the problem of designing efficient parallel algorithms for summing and prefix summing. In this paper, we present optimal algorithms for summing on a latency-dependent distributed-memory model and show that...
详细信息
We consider the problem of designing efficient parallel algorithms for summing and prefix summing. In this paper, we present optimal algorithms for summing on a latency-dependent distributed-memory model and show that any optimal summing algorithm must have an inherent structure. Moreover, we present optimal or near-optimal algorithms for prefix summing for both non-commutative and commutative binary operators. Furthermore, we show that the optimal algorithms for prefix summing for these two types of operators are not equivalent.
For large planar adaptive arrays, the computational load and the terabyte data transmission are two challenging problems in the implementation of adaptive beamforming system. this paper considers the problem of design...
详细信息
ISBN:
(纸本)9781424421923
For large planar adaptive arrays, the computational load and the terabyte data transmission are two challenging problems in the implementation of adaptive beamforming system. this paper considers the problem of designing an efficient parallel adaptive algorithm using the two dimensional(2D) planar array. this paper extends the efficient parallel adaptive beamforming algorithm based on LCMV to adaptive 2D sensor array processing. the proposed 2D parallel adaptive algorithm can be easily implemented in distributed-parallel-processing system due to its parallel structure. It can save the data transmission as well as computing cost.
Class incremental learning needs to deal with a dynamic environment where data class appears incrementally, it is a challenge to learn new knowledge while preserving what has already been learned. On the other hand, d...
详细信息
ISBN:
(纸本)9781728111414
Class incremental learning needs to deal with a dynamic environment where data class appears incrementally, it is a challenge to learn new knowledge while preserving what has already been learned. On the other hand, due to the limited storage of the online scenario, algorithm is usually obstructed to frequently scan or simply store all historical data, it is another challenge to reduce the historical data storage for algorithm. Few existing work have addressed above challenges simultaneously. In this paper, we propose Fisher Discriminant Analysis Random Forest (FDARF), which consists of two parts, GHS (Generate Hierarchical Split) and RRS (Random Reform Subtree), that cooperatively operate. GHS combines FDA (Fisher Discriminant Analysis) with tree hierarchy to learn a hierarchical split of data space that provides strong ability for classification. the statistics in leaves (i.e. historical data) can be described by covariance matrix and further optimized by matrix sketching algorithm to reduce storage;for every tree initialized by GHS, RRS randomly reforms certain state subtree, which creates diversity that can be ensemble for ensuring effectiveness of class incremental learning. Extensive experiments on diverse datasets validate that FDARF can well adapt to the online class incremental learning.
暂无评论