We present a message-passing based parallel version of the space saving algorithm designed to solve the k-majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater t...
详细信息
We present a message-passing based parallel version of the space saving algorithm designed to solve the k-majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is therefore useful for iceberg queries and many other different contexts. We apply our algorithm to the detection of frequent items in both real and synthetic datasets whose probability distribution functions are a Hurwitz and a Zipf distribution respectively. Also, we compare its parallel performances and accuracy against a parallel algorithm recently proposed for merging summaries derived by the spacesaving or Frequent algorithms. (C) 2015 Elsevier Inc. All rights reserved.
High-throughput DNA sequencing is a crucial technology for genomics research. As genetic data grows to hundreds of gigabytes or even terabytes that traditional devices cannot support, high-performance computing plays ...
详细信息
ISBN:
(纸本)9783031213946;9783031213953
High-throughput DNA sequencing is a crucial technology for genomics research. As genetic data grows to hundreds of gigabytes or even terabytes that traditional devices cannot support, high-performance computing plays an important role. However, current high-performance approaches to extracting k-mers cost a large memory footprint due to the high error rate of short-read sequences. This paper proposes Top-Kmer, a parallel k-mer counting workflow that indexes high-frequency k-mers within a tiny counting structure. On the 2048 cores of Tianhe-2, we construct k-mer index tables in 18 s for 174 GB fastq files and complete queries in 1 s for 1 billion k-mers, with a scaling efficiency of 95%. Compared with the state of the art, the counting table's memory usage is reduced by 50% with no performance degradation.
暂无评论