Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this article, wepresent Dmodc, a fast deterministic routing algorithmfo...
详细信息
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this article, wepresent Dmodc, a fast deterministic routing algorithmfor parallel generalized fat trees (PGFTs), whichminimizes congestion risk even undermassive network degradation caused by equipment failure. Dmodc computes forwarding tables witha closed-formarithmetic formula by relying on a fast preprocessing phase. This allowscomplete rerouting of networks with tens of thousands of nodes in less than a second. In turn, this greatly helps centralized fabric management react to faults with high-quality routing tables and has no impact on running applications in current and future very large scale high-performance computing clusters.
Incremental 4D-Var is a data assimilation algorithm used routinely at operational numerical weather prediction (NWP) centres worldwide. The algorithm solves a series of quadratic minimization problems (inner-loops) ob...
详细信息
Incremental 4D-Var is a data assimilation algorithm used routinely at operational numerical weather prediction (NWP) centres worldwide. The algorithm solves a series of quadratic minimization problems (inner-loops) obtained from linear approximations of the forward model around nonlinear trajectories (outer-loops). Since most of the computational burden is associated with the inner-loops, many studies have focused on developing computationally efficient algorithms to solve the least-square quadratic minimization problem, in particular through time parallelization. This paper presents the first implementation and testing of a recently proposed method for parallelizing incremental 4D-Var, the Randomized Incremental Optimal Technique (RIOT), which replaces the traditional sequential conjugate gradient (CG) iterations in the inner-loop of the minimization with fully parallel randomized singular value decomposition (RSVD) of the preconditioned Hessian of the cost function. RIOT is tested using the standard Lorenz-96 model (L-96) as well as two realistic high-dimensional atmospheric source inversion problems based on aircraft observations of black carbon concentrations. A new outer-loop preconditioning technique tailored to RSVD was introduced to improve convergence stability and performance. Results obtained with the L-96 system show that the performance improvement from RIOT compared to standard CG algorithms increases significantly with nonlinearities. Overall, in the realistic black carbon source inversion experiments, RIOT reduces the wall-clock time of the 4D-Var minimization by a factor of 2 to 3, at the cost of a factor of 4 to 10 increase in energy cost due to the large number of parallel cores used. Furthermore, RIOT enables reduction of the wall-clock time computation of the analysis-error covariance matrix by a factor of 40 compared to a standard iterative Lanczos approach. Finally, as evidenced in this study, implementation of RIOT in an operational NWP syste
The locally twisted cube LTQn is a variation of the hypercube Qn, and the diameter of LTQn is only about half of the diameter of Qn. For interconnection networks, some efficient communication algorithms can be designe...
详细信息
Based on current serial algorithms for electromagnetic field computation, the parallel algorithm concept where the "divide and conquer" approach is adopted has designed and implemented electromagnetic field ...
详细信息
The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we ...
详细信息
The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we study conjugate gradient variants which decrease the runtime per iteration by overlapping global synchronizations, and in the case of pipelined variants, matrix-vector products. Through the use of a predict-and-recompute scheme, whereby recursively updated quantities are first used as a predictor for their true values and then recomputed exactly at a later point in the iteration, these variants are observed to have convergence behavior nearly as good as the standard conjugate gradient implementation on a variety of test problems. We provide a rounding error analysis which provides insight into this observation. It is also verified experimentally that the variants studied do indeed reduce the runtime per iteration in practice and that they scale similarly to previously studied communication-hiding variants. Finally, because these variants achieve good convergence without the use of any additional input parameters, they have the potential to be used in place of the standard conjugate gradient implementation in a range of applications.
De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph....
详细信息
De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7 x and 2.0x faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://***/ParBLiSS/bruno.
As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central ...
详细信息
This paper aims to present an updated review of parallel algorithms for solving square and rectangular single and double precision matrix linear systems using multi-core central processing units and graphic processing...
详细信息
In this work, we present a FPGA-based Generalized Hough Transform custom processor to calculate similarities between arbitrary shapes. Raw data are 44 x 36 DC images extracted directly from low-resolution compressed v...
详细信息
In this work, we present a FPGA-based Generalized Hough Transform custom processor to calculate similarities between arbitrary shapes. Raw data are 44 x 36 DC images extracted directly from low-resolution compressed video (352 x 288). The outputs are two numbers per frame that quantify the image similitude in terms of scale and rotation. The proposed architecture efficiently resolves the detection of pixel pairs, and the voting of distances and rotations, without memory access conflicts. These operations are inherent to Hough transformation. The paper condenses some circuit solutions suitable to hardwiring video processing. They take full advantage of using small embedded memories as look-up tables. The complete processor is validated with benchmark video samples that cover different scenarios and problems: sport, drama, and news. The final version internally operates at 100 MHz and fits inside a small FPGA chip. The highly concurrent architecture employs both pipelining and parallelism using hardware replication. The final performance is over 40 Giga fixed-point operations per second.
Aiming at the problems of complex and variable bio-electromagnetic computing, large amount of calculation, and insufficient calculation accuracy to meet the actual clinical needs, a parallel algorithm based on OpenMP ...
详细信息
暂无评论