Sorting any m target suffixes of an input string X of n characters from a constant alphabet is a key task for building the sparse suffix array SSA(X) for index construction. A number of probabilistic and deterministic...
详细信息
Sorting any m target suffixes of an input string X of n characters from a constant alphabet is a key task for building the sparse suffix array SSA(X) for index construction. A number of probabilistic and deterministic algorithms have been proposed for sorting sparsesuffixes with varying time and space complexities, but only some experimental results are available for performance evaluation of these algorithms. We design a divide-and-conquer algorithm called sSAIS for computing SSA(X) in O(n log m log(n/m)) time and O(m) workspace by using the induced sorting principle, and conduct an experimental performance study on real and artificial datasets. This work reveals that to design an efficient deterministic algorithm for sorting sparsesuffixes is a tough challenge and the density of target suffixes might be considered as a critical design parameter.
Sampling (evenly) the suffixes from the suffixarray is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient...
详细信息
Sampling (evenly) the suffixes from the suffixarray is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared with the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work, we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which is more convenient in practice. Experiments show that our algorithm (in a few variants) achieves competitive time-space tradeoffs on most standard benchmark data. Copyright (C) 2017 John Wiley & Sons, Ltd.
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named PSI-RA (parallel sparse index read aligner). The motivation in using SSAs is the ability to trade mem...
详细信息
ISBN:
(纸本)9781424483075
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named PSI-RA (parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to tune the space consumption of the index based on the available memory of the machine and the minimum length of the arriving pattern queries. Although SSAs have been studied before on exact matching of short reads, an elegant way of approximate matching capability was missing. We provide this by defining the right-most mismatch criteria that prioritizes the errors towards the end of the reads since it is known that the errors are more probable at that area. PSI-RA supports any number of mismatches in aligning reads. We give comparisons with some of the well known short read aligners, and show that indexing genome with SSA is a good alternative to Burrows-Wheeler transform or seed based solutions.
Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length...
详细信息
ISBN:
(纸本)9781728118673
Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length of a single read sequence. However, existing tools lack scalability for handling large-scale datasets as well as long sequences. A feasible solution to this problem is to use parallel and distributed systems. The efficient deployment of such systems, however, requires high parallelism in both software implementations as well as algorithmic optimizations. In this paper, we propose DGCF, a Distributed Greedy Clustering Framework which is capable to handle large-scale datasets and long sequences. Our framework adopts a greedy clustering strategy which overlaps communication with computation among many distributed computing nodes. We also design and implement a sparse suffix array (SSA)-based alignment algorithm that can support long sequences. Experiments show that our framework achieves near linear speedups on a distributed memory cluster.
In this work, we present efficient algorithms for constructing sparsesuffix trees, sparse suffix arrays, and sparse position heaps for b arbitrary positions of a text T of length n while using only O(b) words of spac...
详细信息
In this work, we present efficient algorithms for constructing sparsesuffix trees, sparse suffix arrays, and sparse position heaps for b arbitrary positions of a text T of length n while using only O(b) words of space during the construction. Attempts at breaking the naive bound of Omega(nb) time for constructing sparsesuffix trees in O(b) space can be traced back to the origins of string indexing in 1968. First results were not obtained until 1996, but only for the case in which the b suffixes were evenly spaced in T. In this article, there is no constraint on the locations of the suffixes. Our main contribution is to show that the sparsesuffix tree (and array) can be constructed in O(n log(2) b) time. To achieve this, we develop a technique that allows one to efficiently answer b longest common prefix queries on suffixes of T, using only O(b) space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Our first solution is Monte Carlo, and outputs the correct tree with high probability. We then give a Las Vegas algorithm, which also uses O(b) space and runs in the same time bounds with high probability when b = O(root n). Additional trade-offs between space usage and construction time for the Monte Carlo algorithm are given. Finally, we show that, at the expense of slower pattern queries, it is possible to construct sparse position heaps in O(n+b log b) time and O(b) space.
Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length...
详细信息
ISBN:
(纸本)9781728118680
Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length of a single read sequence. However, existing tools lack scalability for handling large-scale datasets as well as long sequences. A feasible solution to this problem is to use parallel and distributed systems. The efficient deployment of such systems, however, requires high parallelism in both software implementations as well as algorithmic optimizations. In this paper, we propose DGCF, a Distributed Greedy Clustering Framework which is capable to handle large-scale datasets and long sequences. Our framework adopts a greedy clustering strategy which overlaps communication with computation among many distributed computing nodes. We also design and implement a sparse suffix array (SSA)-based alignment algorithm that can support long sequences. Experiments show that our framework achieves near-linear speedups on a distributed memory cluster.
暂无评论