One of the most fundamental method for comparing two given strings A and B is the longest common subsequence (LCS), where the task is to find (the length) of an LCS of A and B . In this paper, we deal with the STR-IC-...
详细信息
One of the most fundamental method for comparing two given strings A and B is the longest common subsequence (LCS), where the task is to find (the length) of an LCS of A and B . In this paper, we deal with the STR-IC-LCS 1 problem which is one of the constrained LCS problems proposed by Chen and Chao [J. Comb. Optim, 2011]. A string Z is said to be an STR-IC-LCS of three given strings A , B , and P , if Z is a longest string satisfying that (1) Z includes P as a substring and (2) Z is a common subsequence of A and B . We present three efficient algorithms for this problem: First, we begin with a space-efficient solution which computes the length of an STR-IC-LCS in O ( n 2 ) time and O ((e + 1)( n - e + 1)) space, where e is the length of an LCS of A and B of length n . When e = O (1) or n - e = O (1), then this algorithm uses only linear O ( n ) space. Second, we present a faster algorithm that works in O ( nr / log r + n ( n - e+ 1)) time, where r is the length of P , while retaining the O ((e + 1)( n - e + 1)) space efficiency. Third, we give an alternative algorithm that runs in O ( nr / log r + n ( n - e ' +1)) time with O ((e ' + 1)( n - e ' + 1)) space, where e ' denotes the STR-IC-LCS length for input strings A , B , and P .
The directed acyclic word graph (DAWG) of a string y of length n is the smallest (partial) DFA which recognizes all suffixes of y with only O (n) nodes and edges. In this paper, we show how to construct the DAWG for t...
详细信息
The directed acyclic word graph (DAWG) of a string y of length n is the smallest (partial) DFA which recognizes all suffixes of y with only O (n) nodes and edges. In this paper, we show how to construct the DAWG for the input string y from the suffix tree for y, in O (n) time for integer alphabets of polynomial size in n. In so doing, we first describe a folklore algorithm which, given the suffix tree for y, constructs the DAWG for the reversed string y in O (n) time. Then, we present our algorithm that builds the DAWG for y in O (n) time for integer alphabets, from the suffix tree for y. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O (n)-time algorithm for constructing the affix tree of a given string y over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. We then discuss how our constructions can lead to linear-time algorithms for building other text indexing structures, such as linear-size suffix tries and symmetric CDAWGs in linear time in the case of integer alphabets. As a further application to our O (n)-time DAWG construction algorithm, we show that the set MAW(y) of all minimal absent words (MAWs) of y can be computed in optimal, input- and output-sensitive O(n + |MAW(y)|) time and O (n) working space for integer alphabets.
Palindromes are strings that read the same forward and backward. Problems of computing palindromic structures in strings have been studied for many years with the motivation of their application to biology. The longes...
详细信息
Palindromes are strings that read the same forward and backward. Problems of computing palindromic structures in strings have been studied for many years with the motivation of their application to biology. The longest palindrome problem is one of the most important and classical problems regarding palindromic structures, that is, to compute the longest palindrome appearing in a string T of length n. The problem can be solved in O(n) time by the famous algorithm of Manacher (1975) [27]. This paper generalizes the longest palindrome problem to the problem of finding the top -k longest palindromes in an arbitrary substring, including the input string T itself. The internal top -k longest palindrome query is, given a substring T[i..j] of T and a positive integer k as a query, to compute the top -k longest palindromes appearing in T[i..j]. This paper proposes a linear-size data structure that can answer internal top -k longest palindromes query in optimal O(k) time. Also, given the input string T, our data structure can be constructed in O(n log n) time. For k =1, the construction time is reduced to O(n).(c) 2023 Elsevier B.V. All rights reserved.
The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1, ..., Sk} of k strings of total length n, we are asked to find, for each st...
详细信息
The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set {S1, ..., Sk} of k strings of total length n, we are asked to find, for each string Si, i & ISIN;[1, k], its longest suffix that is a prefix of string Sj, for all j =? i, j & ISIN;[1, k]. Several algorithms running in the optimal O(n + k2) time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space S2(n) in any case. We consider the parameterized version of the APSP problem, denoted by t-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least t. We give an algorithm for solving t-APSP that runs in the optimal O(n + |OUTPUTt|) time using O(n) space, where OUTPUTt is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting t = 0. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine.(c) 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://***/licenses/by/4.0/).
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Theta(...
详细信息
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Theta(nlogn) bits of space, for a string of size n. This is considerably more than the nlog(2) sigma bits needed for the string itself, where s is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Theta(n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when s is small as in DNA sequences. In this article, we introduce the first compressed suffix tree representation that breaks this Theta(n)bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA) operation. We reveal important connections between LCAs and suffix tree navigation. We also describe how to make FCSTs dynamic, that is, support updates to the text. The dynamic FCST also supports several operations. In particular, it can build the static FCST within optimal space and polylogarithmic time per symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in practice as well.
We propose a simple linear- time on- line algorithm for constructing a position heap for a string (Ehrenfeucht et al., 2011 [8]). Our definition of position heap differs slightly from the one proposed in Ehrenfeucht e...
详细信息
We propose a simple linear- time on- line algorithm for constructing a position heap for a string (Ehrenfeucht et al., 2011 [8]). Our definition of position heap differs slightly from the one proposed in Ehrenfeucht et al. (2011) [8] in that it considers the suffixes ordered in the descending order of length. Our construction is based on classic suffix pointers and resembles Ukkonen's algorithm for suffix trees (Ukkonen, 1995 [17]). Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear- time string matching algorithm (Ehrenfeucht et al., 2011 [8]). (C) 2012 Elsevier B.V. All rights reserved.
Computing approximate patterns in strings or sequences has important applications in DNA sequence analysis, data compression, musical text analysis, and so on. In this paper, we introduce approximate k-covers and stud...
详细信息
Computing approximate patterns in strings or sequences has important applications in DNA sequence analysis, data compression, musical text analysis, and so on. In this paper, we introduce approximate k-covers and study them under various commonly used distance measures. We propose the following problem: "Given a string x of length n, a set U of m strings of length k, and a distance measure, compute the minimum number t such that U is a set of approximate k-covers for x with distance t". To solve this problem, we present three algorithms with time complexity O(km(n - k)), O(mn 2 ) and O(mn 2 ) under Hamming, Levenshtein and edit distance, respectively. A World Wide Web server interface has been established at for automated use of the programs.
We introduce the problem of computing the Burrows-Wheeler Transform (BWT) using small additional space. Our in-place algorithm does not need the explicit storage for the suffix sort array and the output array, as typi...
详细信息
We introduce the problem of computing the Burrows-Wheeler Transform (BWT) using small additional space. Our in-place algorithm does not need the explicit storage for the suffix sort array and the output array, as typically required in previous work. It relies on the combinatorial properties of the BWT, and runs in O(n(2)) time in the comparison model using O(1) extra memory cells, apart from the array of n cells storing the n characters of the input text. We then discuss the time-space trade-off when O(***(k)) extra memory cells are allowed with sigma(k) distinct characters, providing an O((n(2)/k + n) log k)-time algorithm to obtain (and invert) the BWT. For example in real systems where the alphabet size is a constant, for any arbitrarily small c > 0, the BWT of a text of n bytes can be computed in O(n epsilon(-1) log n) time using just epsilon n extra bytes. (C) 2015 Elsevier B.V. All rights reserved.
We contribute a further step towards the plausible real-time construction of suffix trees by presenting an on-line algorithm that spends only O(log logn) time processing each input symbol and takes O(n log logn) time ...
详细信息
We contribute a further step towards the plausible real-time construction of suffix trees by presenting an on-line algorithm that spends only O(log logn) time processing each input symbol and takes O(n log logn) time in total, where n is the length of the input text. Our results improve on a previously published algorithm that takes O(logn) time per symbol and O(n logn) time in total. The improvements are obtained by adapting Weiner's suffix tree construction algorithm to use a new data structure for the fringe marked ancestor problem, a special case of the nearest marked ancestor problem, which may be of independent interest. (C) 2012 Elsevier B. V. All rights reserved.
The problem of generalized function matching can be defined as follows: given a pattern p = p 1 ⋯ p m and a text t = t 1 ⋯ t n , find a mapping f : ∑ p →∑ t ⁎ ; and all text locations i such that f(p 1 )f(p 2 ) ⋯ f...
详细信息
The problem of generalized function matching can be defined as follows: given a pattern p = p 1 ⋯ p m and a text t = t 1 ⋯ t n , find a mapping f : ∑ p →∑ t ⁎ ; and all text locations i such that f(p 1 )f(p 2 ) ⋯ f(p m )=t i ⋯ t j , a substring of t . By modifying the restrictions of the matching function f , one can obtain different matching problems, many of which have important applications. When f : ∑ p → ∑ t we are faced with problems found in the well-established field of combinatorial pattern matching. If the single character constraint is lifted and f : ∑ p →∑ t ⁎ we obtain generalized function matching as introduced by Amir and Nor (JDA 2007). If we further constrain f to be injective, then we arrive at generalized parametrized matching as defined by Clifford etal. (SPIRE 2009). There are a number of important applications for pattern matching in computational biology, text editors and data compression, to name a few. Therefore, many efficient algorithms have been developed for a wide variety of specific problems including finding tandem repeats in DNA sequences, optimizing embedded systems by reusing code etc. In this work we present a heuristic algorithm illustrating a practical approach to tackling a variant of generalized function matching where f : ∑ p → ∑ t + and demonstrate its performance on human-produced text as well as random strings.
暂无评论