We present a simple algorithm for computing the document array given a string collection and its suffix array as input. Our algorithm runs in linear time using constant additional space for strings from constant alpha...
详细信息
We present a simple algorithm for computing the document array given a string collection and its suffix array as input. Our algorithm runs in linear time using constant additional space for strings from constant alphabets. (C) 2019 Elsevier B.V. All rights reserved.
Constructing the suffix array for a string collection is an important task that may be performed by sorting the concatenation of all strings. In this article we present algorithms gSAIS and gSACA-K, which extend SAIS ...
详细信息
Constructing the suffix array for a string collection is an important task that may be performed by sorting the concatenation of all strings. In this article we present algorithms gSAIS and gSACA-K, which extend SAIS (Nong et al., 2011) [8] and SACA-IC (Nong, 2013) [10] to construct the suffix array for a string collection maintaining their theoretical bounds, respecting the order among all suffixes, and improving their practical performance. gSACA-K is an optimal suffix array construction algorithm for string collections from constant alphabets. Moreover, we show how to modify gSAIS and gSACA-K to also compute the longest common prefix array and the document array as a byproduct of suffix sorting, with the same theoretical bounds. We performed experiments that showed that our algorithms have an outstanding practical performance. (C) 2017 Elsevier B.V. All rights reserved.
Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Pre...
详细信息
Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection withNsymbols inO(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K(Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.
暂无评论