As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central ...
详细信息
Based on current serial algorithms for electromagnetic field computation, the parallel algorithm concept where the "divide and conquer" approach is adopted has designed and implemented electromagnetic field ...
详细信息
The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we ...
详细信息
The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we study conjugate gradient variants which decrease the runtime per iteration by overlapping global synchronizations, and in the case of pipelined variants, matrix-vector products. Through the use of a predict-and-recompute scheme, whereby recursively updated quantities are first used as a predictor for their true values and then recomputed exactly at a later point in the iteration, these variants are observed to have convergence behavior nearly as good as the standard conjugate gradient implementation on a variety of test problems. We provide a rounding error analysis which provides insight into this observation. It is also verified experimentally that the variants studied do indeed reduce the runtime per iteration in practice and that they scale similarly to previously studied communication-hiding variants. Finally, because these variants achieve good convergence without the use of any additional input parameters, they have the potential to be used in place of the standard conjugate gradient implementation in a range of applications.
Background: This paper puts forward a parallel algorithm of association rules applicable for sales data analysis based on association rules by utilizing the idea of division and designs a sales management system for m...
详细信息
De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph....
详细信息
De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7 x and 2.0x faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://***/ParBLiSS/bruno.
In this work, we present a FPGA-based Generalized Hough Transform custom processor to calculate similarities between arbitrary shapes. Raw data are 44 x 36 DC images extracted directly from low-resolution compressed v...
详细信息
In this work, we present a FPGA-based Generalized Hough Transform custom processor to calculate similarities between arbitrary shapes. Raw data are 44 x 36 DC images extracted directly from low-resolution compressed video (352 x 288). The outputs are two numbers per frame that quantify the image similitude in terms of scale and rotation. The proposed architecture efficiently resolves the detection of pixel pairs, and the voting of distances and rotations, without memory access conflicts. These operations are inherent to Hough transformation. The paper condenses some circuit solutions suitable to hardwiring video processing. They take full advantage of using small embedded memories as look-up tables. The complete processor is validated with benchmark video samples that cover different scenarios and problems: sport, drama, and news. The final version internally operates at 100 MHz and fits inside a small FPGA chip. The highly concurrent architecture employs both pipelining and parallelism using hardware replication. The final performance is over 40 Giga fixed-point operations per second.
This paper aims to present an updated review of parallel algorithms for solving square and rectangular single and double precision matrix linear systems using multi-core central processing units and graphic processing...
详细信息
Aiming at the problems of complex and variable bio-electromagnetic computing, large amount of calculation, and insufficient calculation accuracy to meet the actual clinical needs, a parallel algorithm based on OpenMP ...
详细信息
The determination of flow directions is an essential step for drainage network extraction, and flat surfaces are common features in flow direction determination. With the challenge of a massive volume of digital eleva...
详细信息
We propose a novel algorithm for extracting data from images of tabular documents having a specific structure. Our proposed method is able to maintain the original table format and structure, and offers better efficie...
详细信息
暂无评论