A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more...
详细信息
A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more than three decades. To date, all known algorithms may take quadratic time (shaved by logarithmic factors) to find large LCS. In this paper, the problem of approximating LCS is studied, while focusing on the hard inputs for this problem, namely, approximating LCS of near-linear size in strings over a relatively large alphabet (of size at least n(epsilon) for some constant epsilon > 0, where n is the length of the string). We show that, any given string over a relatively large alphabet can be embedded into a locally non-repetitive string. This embedding has a negligible additive distortion for strings that are not too dissimilar in terms of the edit distance. We also show that LCS can be efficiently approximated in locally-non-repetitive strings. Our new method (the embedding together with the approximation algorithm) gives a strictly sub-quadratic time algorithm (i.e., of complexity 0(n(2-epsilon)) for some constant epsilon) which can find common subsequences of linear (and near linear) size that cannot be detected efficiently by the existing tools. (C) 2011 Elsevier Inc. All rights reserved.
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Theta(...
详细信息
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Theta(nlogn) bits of space, for a string of size n. This is considerably more than the nlog(2) sigma bits needed for the string itself, where s is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Theta(n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when s is small as in DNA sequences. In this article, we introduce the first compressed suffix tree representation that breaks this Theta(n)bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA) operation. We reveal important connections between LCAs and suffix tree navigation. We also describe how to make FCSTs dynamic, that is, support updates to the text. The dynamic FCST also supports several operations. In particular, it can build the static FCST within optimal space and polylogarithmic time per symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in practice as well.
In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum n...
详细信息
ISBN:
(纸本)9783642212536
In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum number of different symbols that parenthesise any position in the shuffle word. This problem is motivated by an application of a new automaton model and can be regarded as the problem of scheduling shared memory accesses of some parallel processes in a way that minimises the number of memory cells required. We investigate the complexity of this problem and show that it can be solved in polynomial time.
This paper provides a deterministic algorithm for finding a longest open reading frame (ORF) among all alternative splicings of a given DNA sequence. Finding protein encoding regions is a fundamental problem in genomi...
详细信息
ISBN:
(纸本)9781457716133
This paper provides a deterministic algorithm for finding a longest open reading frame (ORF) among all alternative splicings of a given DNA sequence. Finding protein encoding regions is a fundamental problem in genomic DNA sequence analysis and long ORFs generally provide good predictions of such regions. Although the number of splice variants is exponential in the number of optionally spliced regions, we are able to in many cases obtain quadratic or even linear performance. This efficiency is achieved by limiting the size of the search space for potential ORFs: by properly pruning the search space we can reduce the number of frames considered at any one time while guaranteeing that a longest open reading frame must be among the considered frames.
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3) n - 5/3 three-way character comparisons. The best previous bound,...
详细信息
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3) n - 5/3 three-way character comparisons. The best previous bound, which has stood unchallenged for more than 25 years, is (3/2) n - O(1) comparisons. We also prove an interesting property of an algorithm for computing the maximum suffix both with respect to a total order < and with respect to its inverse order >. (C) 2011 Elsevier B.V. All rights reserved.
We present a filtering based algorithm for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols in either p or t (but not bot...
详细信息
We present a filtering based algorithm for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols in either p or t (but not both), and a bound k, our algorithm finds all the places that the pattern matches the text with at most k mismatches. The algorithm is deterministic and runs in Theta(nm(1/3)k(1/3) log(2/3) m) time. (C) 2010 Elsevier B.V. All rights reserved.
We present solutions for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols and a bound k, our algorithms find all the plac...
详细信息
We present solutions for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols and a bound k, our algorithms find all the places that the pattern matches the text with at most k mismatches. We first give a Theta (n(k+log m log k) log n) time randomised algorithm which finds the correct answer with high probability. We then present a new deterministic Theta(nk(2) log(2) M) time solution that uses tools originally developed for group testing. Taking our derandomisation approach further we develop an approach based on k-selectors that runs in Theta(nk polylog m) time. Further, in each case the location of the mismatches at each alignment is also given at no extra cost. (C) 2009 Elsevier Inc. All rights reserved.
The Longest Common Subsequence (LCS) of two strings A, B is a well studied problem having a wide range of applications. When each symbol of the input strings is assigned a positive weight the problem becomes the Heavi...
详细信息
The Longest Common Subsequence (LCS) of two strings A, B is a well studied problem having a wide range of applications. When each symbol of the input strings is assigned a positive weight the problem becomes the Heaviest Common Subsequence (HCS) problem. In this paper we consider a different version of weighted LCS on Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of LCS on PWM. For the first, we solve the LCS problem of z sequences in time O(zn(z+ 1)). For the second, we consider the log-probability version of the problem, prove NP-hardness and provide an approximation algorithm. (C) 2010 Elsevier B.V. All rights reserved.
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3)n - 5/3 three-way character comparisons. The best previous hot id....
详细信息
ISBN:
(纸本)9783642130724
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3)n - 5/3 three-way character comparisons. The best previous hot id. which has stood unchallenged for more than 25 years, is (3/2)n - O(1) comparisons. We also prove an interesting property of an algorithm for computing the maximum suffix both with respect to a total order < and with respect to its inverse order >.
The Parikh vector of a string s over a finite ordered alphabet Sigma = {a1 , ...., a sigma} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p(1) , ...., p(sigma)), where p(i) = vertical bar{...
详细信息
ISBN:
(纸本)9788001044032
The Parikh vector of a string s over a finite ordered alphabet Sigma = {a1 , ...., a sigma} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p(1) , ...., p(sigma)), where p(i) = vertical bar{j vertical bar s(j) = a(i)}vertical bar. Parikh vector q occurs in s if s has a substring 1 with AI) = q. The problem of searching for a query q in a text s of length ti can be solved simply and optimally with a sliding window approach in O(n) time. We present two new algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm finds all occurrences of a given Parikh vector in a text (over a fixed alphaheL of size sigma >= 2) t-trid appeaN to have a uh lirii ar eXpected time COrriplexity. The second algorithm only decides whether a given Parikh vector appears in a binary text;it iteratively constructs a linear size data structure which then allows answering queries in constant time, for many queries even during the construction phase.
暂无评论