Constructing evolutionary trees for species sets is a fundamental problem in biology. Unfortunately, there is no single agreed upon method for this task, and many methods are in use. Current practice dictates that tre...
详细信息
Constructing evolutionary trees for species sets is a fundamental problem in biology. Unfortunately, there is no single agreed upon method for this task, and many methods are in use. Current practice dictates that trees be constructed using different methods and that the resulting trees should be compared for consensus. It has become necessary to automate this process as the number of species under consideration has grown. We study one formalization of the problem: the maximum agreement-subtree (MAST) problem. The MAST problem is as follows: given a set A and two rooted trees T-0 and T-1 leaf-labeled by the elements of A, find a maximum-cardinality subset B of A such that the topological restrictions of T-0 and T-1 to B are isomorphic. In this paper, we will show that this problem reduces to unary weighted bipartite matching (UWBM) with an O(n(1+o(1))) additive overhead. We also show that UWBM reduces linearly to MAST. Thus our algorithm is optimal unless UWBM can be solved in near linear time. The overall running time of our algorithm is O(n(1.5)log n), improving on the previous best algorithm, which runs in O(n(2)). We also derive an O(nc(root log n))-time algorithm for the case of bounded degrees, whereas the previously best algorithm runs in O(n(2)), as in the unbounded case.
Calculating the length l of a longest common subsequence (LCS) of two strings, A of length n and B of length m, is a classic research topic, with many known worst-case oriented results. We present three algorithms for...
详细信息
Calculating the length l of a longest common subsequence (LCS) of two strings, A of length n and B of length m, is a classic research topic, with many known worst-case oriented results. We present three algorithms for LCS length calculation with respectively O(mn 1g 1g n/ lg(2) n), O(mn/ lg(2) n + r) and O(n + r) time complexity, where the second one works for r = o(mn/(lg n lg lg n)), and the third one for r = Theta(mn/ lg(k) n), for a real constant 1 <= k <= 3, and l = O(n/(lg(k-1) n(lg lg n)(2))), where r is the number of matches in the dynamicprogramming matrix. We also describe conditions for a given problem sufficient to apply our techniques, with several concrete examples presented, namely the edit distance, the longest common transposition-invariant subsequence (LCTS) and the merged longest common subsequence (MerLCS) problems. (C) 2015 Elsevier B.V. All rights reserved.
Calculating the length of a longest common subsequence (LCS) of two strings, A of length n and B of length m, is a classic research topic, with many worstcase oriented results known. We present two algorithms for LCS ...
详细信息
Given two strings of lengths m and n, with m <= n, the longest common subsequence problem consists of computing a common subsequence of maximum length by deleting symbols from both strings. While the O(mn) algorith...
详细信息
Given two strings of lengths m and n, with m <= n, the longest common subsequence problem consists of computing a common subsequence of maximum length by deleting symbols from both strings. While the O(mn) algorithm devised in 1974 is optimal in the most general setting, algorithms that depend on parameters other than m and n have been proposed since then. In the word RAM model, let w be the word size, s be the alphabet size, d be the number of dominant symbol matches between the strings, and p be the length of the longest common subsequence. Fast algorithms for this problem have complexities O(mn/ log n), O(mn/w), O(ns + min(p(n - p), pm)), O(n logs + d log log min(d, mn/d)), O(ns+min(ds, pm)), and O(ns+s!2s+d logs). In this work, we present an O(n(s+log & lowast;n)+ min(d logs, pm)) algorithm when s is an element of O(w), and also an O(n(s + log & lowast;n) + d) algorithm when s <= w which uses bitwise instructions that became recently available in modern processors.
Given strings A = a(1)a(2)...a(m) and B=b(1)b(2)...b(n) over an alphabet Sigma subset of U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A, B) that gives the scor...
详细信息
Given strings A = a(1)a(2)...a(m) and B=b(1)b(2)...b(n) over an alphabet Sigma subset of U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A, B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is min(t is an element of U){d(A + t, B)}, where A + t = (a(1) + t)(a(2) + t)...(a(m) + t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, including Hamming distance, longest common subsequence (LCS), Levenshtein distance, and their versions where the exact matching condition is replaced by an approximate one. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance, and for some we achieve these upper bounds. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search. As a byproduct, we give improved sparse dynamic programming algorithms to compute LCS and Levenshtein distance. (c) 2004 Elsevier Inc. All rights reserved.
We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty l, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(...
详细信息
We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty l, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p vertical bar M vertical bar log vertical bar t vertical bar), where M = {(i,j)vertical bar s(i) = t(j)} is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamicprogramming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamicprogramming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted.
The LCS of two rooted, ordered, and labeled trees F and G is the largest forest that can be obtained from both trees by deleting nodes. We present algorithms for computing tree LCS which exploit the sparsity inherent ...
详细信息
The LCS of two rooted, ordered, and labeled trees F and G is the largest forest that can be obtained from both trees by deleting nodes. We present algorithms for computing tree LCS which exploit the sparsity inherent to the tree LCS problem. Assuming G is smaller than F, our first algorithm runs in time O(r . height(F) . height(G) . lg lg vertical bar G vertical bar), where r is the number of pairs (upsilon is an element of F, omega is an element of G) such that upsilon and omega) have the same label. Our second algorithm runs in time O(Lr lg r . lg lg vertical bar G vertical bar), where L is the size of the LCS of F and G. For this algorithm we present a novel three-dimensional alignment graph. Our third algorithm is intended for the constrained variant of the problem in which only nodes with zero or one children can be deleted. For this case we obtain an O(rh lg lg vertical bar G vertical bar) time algorithm, where h = height(F) + height(G). (C) 2009 Elsevier B.V. All rights reserved.
A peak is a pair of real values (x, y), where x is the time when peak of height y is registered. In the peak alignment problem, we are given two sequences of peaks, and our task is to align the sequences allowing some...
详细信息
A peak is a pair of real values (x, y), where x is the time when peak of height y is registered. In the peak alignment problem, we are given two sequences of peaks, and our task is to align the sequences allowing some basic edit operations on the peaks. We study an instance of the peak alignment problem that arises in the analysis of Mass Spectrometry data in Systems Biology. There the measurement technique guarantees that two peaks (x, y), (x', y') can only be considered the same if x is close enough toe, and y is close enough to y'. We review some methods to do alignment under such restrictions on matches. (C) 2007 Elsevier B.V. All rights reserved.
Finding the longest common subsequence in k-length substrings (LCSk) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols ...
详细信息
Finding the longest common subsequence in k-length substrings (LCSk) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols from two sequences A and B are replaced with matching non-overlapping substrings of length k from A and B. We propose several algorithms for LCSk, being non-trivial incarnations of the major concepts known from LCS research (dynamicprogramming, sparse dynamic programming, tabulation). Our algorithms make use of a linear-time and linear-space preprocessing finding the occurrences of all the substrings of length k from one sequence in the other sequence. (C) 2014 Elsevier B.V. All rights reserved.
We develop efficient dynamicprogramming algorithms for pattern matching with general gaps and character classes. We consider patterns of the form p(0) g(a(0),b(0))p(1)g(a(1),b(1)) ... p(m-1), where p(i) subset of Sig...
详细信息
We develop efficient dynamicprogramming algorithms for pattern matching with general gaps and character classes. We consider patterns of the form p(0) g(a(0),b(0))p(1)g(a(1),b(1)) ... p(m-1), where p(i) subset of Sigma, Sigma is some finite alphabet, and g(a(i) ,b(i)) denotes a gap of length a(i) ...b(i) between symbols p(i) and p(i+1). The text symbol t(j) matches p(i) iff t(j) is an element of p(i) . Moreover, we require that if p(i) matches t(j) , then p(i+1) should match one of the text symbols t(j+ai+1) ... t(j+bi+1). Either or both of a(i) and b (i) can be negative. We also consider transposition invariant matching, i.e., the matching condition becomes t(j) is an element of p(i) + tau, for some constant tau determined by the algorithms. We give algorithms that have efficient average and worst case running times. The algorithms have important applications in music information retrieval and computational biology. We give experimental results showing that the algorithms work well in practice.
暂无评论