The suffix array is a crucial data structure for efficient string analysis. Over the course of twenty-six years, sequential suffix array construction algorithms have achieved O(n) time complexity and inplace sorting. ...
详细信息
The suffix array is a crucial data structure for efficient string analysis. Over the course of twenty-six years, sequential suffix array construction algorithms have achieved O(n) time complexity and inplace sorting. In this paper, we present the Tunnel algorithm, the first large-scale parallel suffix array (n ) construction algorithm with a time complexity of O based on the parallel random access machine (PRAM) model. The Tunnel algorithm is built on three key ideas: dividing the problem of size O(n) into p sub-problems of reduced size O by replacing long suffixes with shorter prefixes of size at most a constant D;introducing a Tunnel mechanism to efficiently induce the order of a set of suffixes with long common prefixes;developing a strategy to transform a partially ordered suffix set into a total order relation by iteratively applying the Tunnel inducing method. We provide a detailed description of the algorithm, along with a thorough analysis of its time and space complexity, to demonstrate its correctness and efficiency. The proposed Tunnel algorithm exhibits scalable performance, making it suitable for large string analytics on large-scale parallel systems. & COPY;2023 Elsevier B.V. All rights reserved.
Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated su...
详细信息
Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, either by using efficient ad-hoc approaches which do not generalize to all input graphs, or by indexing difficult graphs and having worst-case exponential complexities. In the absence of an ubiquitous and polynomial tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms in the worst case: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.
A string with many repetitions can be written compactly by replacing h-fold contiguous repetitions of substring r with (r)h. We refer to such a compact representation as a repetition representation string or RRS, by w...
详细信息
ISBN:
(纸本)9783642163203
A string with many repetitions can be written compactly by replacing h-fold contiguous repetitions of substring r with (r)h. We refer to such a compact representation as a repetition representation string or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding a minimum RRS or MRRS, where the size of an RRS is defined to be the sum of its component letter sizes and the sizes needed to describe the repetitions (.)h which are defined as w(R)(h) using a repetition weight function w(R). We develop two dynamic programming algorithms to solve the problem. One is CMR that works for any repetition weight function, and the other is CMR-C that is faster but can be applied only when the repetition weight function is constant. CMR-C is an O(w(n + z))-time algorithm using O(n + z) space for a given string with length n, where w and z are the number of distinct primitive tandem repeats and the number of their occurrences, respectively. Since w = O(n) and z = O(n log n) in the worst case, CMR-C is an O(n(2) log n)-time O(n log n)-space algorithm, which is faster than OMR by ((log n)/n)-factor.
The Simon's congruence problem is to determine whether or not two strings have the same set of subsequences of length no greater than a given integer, and the problem can be answered in linear time. We consider th...
详细信息
The Simon's congruence problem is to determine whether or not two strings have the same set of subsequences of length no greater than a given integer, and the problem can be answered in linear time. We consider the Simon's congruence pattern matching problem that looks for all substrings of a text that are congruent to a pattern under the Simon's congruence. We propose a linear time algorithm by reusing results from previous computations with the help of new data structures called X -trees and Y -trees. Moreover, we investigate several variants of the problem such as identifying the shortest substring or subsequence of the text that is congruent to the pattern under the Simon's congruence, or finding frequent matchings. We design efficient algorithms for these problems. We conclude the paper with two open problems: finding the longest congruent subsequence and optimizing the pattern matching problem.
The Hamming distance with shifts was introduced by Bookstein et al. as a generalization of the traditional Hamming distance to allow a tunable degree of fuzziness when comparing two binary sequences of the same length...
详细信息
The Hamming distance with shifts was introduced by Bookstein et al. as a generalization of the traditional Hamming distance to allow a tunable degree of fuzziness when comparing two binary sequences of the same length. We present a linear-time algorithm for computing this distance. The previous best time bound was quadratic.
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y(1), ... , y(k) over an alphabet Sigma, we are asked to compute the set M-{y1,...,yk}(l) of ...
详细信息
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y(1), ... , y(k) over an alphabet Sigma, we are asked to compute the set M-{y1,...,yk}(l) of minimal absent words of length at most l of the collection {y(1), ..., y(k)}. The set M-{y1,...,yk}(l) contains all the words x such that x is absent from all the words of the collection while there exist i, j, such that the maximal proper suffix of x is a factor of y(i) and the maximal proper prefix of x is a factor of y(j). In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set M-y(l) of minimal absent words of a word y is equal to M-{y1,...,yk}(l) for any decomposition of y into a collection of words y1,..., yk such that there is an overlap of length at least l - 1 between any two consecutive words in the collection. This computation generally requires Omega(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Omega(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when parallel to M-{y1,...,yk}(l)parallel to = o(n), for all N is an element of [1, k], where parallel to S parallel to denotes the sum of the lengths of words in set S. For instance, in the human genome, n approximate to 3x10(9) but parallel to M-{y1,...,yk}(12) parallel to approximate to 10(6). We consider a constant-sized alphabet for stating our results. We show that all M-y1(l) , ... , M-{y1,...,yk}(l) can be computed in O( kn + Sigma(k)(N=1) parallel to M-{y1,...,yk}(l) parallel to) total time using O(MAXIN + MAXOUT) space, where MAXIN is the length of the longest word in {y(1), ... , y(k)} and MAXOUT = max{parallel to M-{y1,...,yk}(l)parallel to : N is an element of [
We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the dis...
详细信息
We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models;2) exact analysis of window-based pattern matching algorithms;3) sensitivity of filtration seeds used to detect candidate sequence alignments;4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions;and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework.
In the segment-based approach to sequence alignment, nucleic acid, and protein sequence alignments are constructed from fragments, i.e., from pairs of ungapped segments of the input sequences. Given a set F of candida...
详细信息
In the segment-based approach to sequence alignment, nucleic acid, and protein sequence alignments are constructed from fragments, i.e., from pairs of ungapped segments of the input sequences. Given a set F of candidate fragments and a weighting function w : F --> R-0(+), the score of an alignment is defined as the sum of weights of the fragments it consists of, and the optimization problem is to find a consistent collection of pairwise disjoint fragments with maximum sum of weights. Herein, a sparse dynamic programming algorithm is described that solves the pairwise segment-alignment problem in O(L + N-max) space where L is the maximum length of the input sequences while N-max less than or equal to #F holds. With a recently introduced weighting function w, small sets F of candidate fragments are sufficient to obtain alignments of high quality, As a result, the proposed algorithm runs in essentially linear space. (C) 2001 Elsevier Science Ltd. All rights reserved.
Recently, Chowdhury et al. [5] proposed the longest common palindromic subsequence problem. It is a variant of the well-known LCS problem, which refers to finding a palindromic LCS between two strings T-1 and T-2. In ...
详细信息
Recently, Chowdhury et al. [5] proposed the longest common palindromic subsequence problem. It is a variant of the well-known LCS problem, which refers to finding a palindromic LCS between two strings T-1 and T-2. In this paper, we present a new O (n + R-2)-time algorithm where n = vertical bar T-1 vertical bar = vertical bar T-2 vertical bar and R is the number of matches between T-1 and T-2. We also show that the average running time of our algorithm is O(n(4)/vertical bar Sigma vertical bar(2)), where E is the alphabet of T-1 and T-2. This improves the previously best algorithms whose running times are O(n(4)) and O(R-2 log(2)n log logn). (C) 2017 Elsevier B.V. All rights reserved.
Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when...
详细信息
Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings < m(1), m(2)> occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that hold only for this class of motifs. Then, we eliminate the remaining redundancy by defining the notion of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandem motifs is O(d(2)n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. The notions and bounds provided for tandem motifs are generalized for the case r >= 2, if r is the number of subwords composing the motifs. Finally, we also provide an algorithm to extract irredundant tandem motifs. (C) 2013 Elsevier B.V. All rights reserved.
暂无评论