Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close ...
详细信息
Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most strings problem is, given a set S of the same-length strings, and a parameter d, find a string x that maximizes the number of "non-outliers" within Hamming distance d of x. We prove that this problem has no polynomial-time approximation scheme (PTAS) unless NP has randomized polynomial-time algorithms, correcting a decade-old erroneous proof made previously in the literature. The Most strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k;we show it has no PTAS unless P = NP. We also observe Closest to k strings has no efficient PTAS (EPTAS) unless a parameterized complexity hierarchy collapses. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult. (C) 2013 Elsevier B.V. All rights reserved.
As discussed at length in Christodoulakis et al. (2015) [3], there is a natural one-many correspondence between simple undirected graphs G with vertex set V = {l, 2,..., n) and indeterminate strings x = x[1..n] - that...
详细信息
As discussed at length in Christodoulakis et al. (2015) [3], there is a natural one-many correspondence between simple undirected graphs G with vertex set V = {l, 2,..., n) and indeterminate strings x = x[1..n] - that is, sequences of subsets of some alphabet Sigma. In this paper, given g, we consider the "reverse engineering" problem of computing a corresponding x on an alphabet Sigma(min) of minimum cardinality. This turns out to be equivalent to the NP-hard problem of computing the intersection number of G, thus in turn equivalent to the clique cover problem. We describe a heuristic algorithm that computes an approximation to Sigma(min) and a corresponding x. We give various properties of our algorithm, including some experimental evidence that on average it requires O(n(2) logn) time. We compare it with other heuristics, and state some conjectures and open problems. (C) 2017 Elsevier B.V. All rights reserved.
Two linear time algorithms are presented. One for determining, for every position in a given square matrix, the longest prefix of a given pattern (also a square matrix) that occurs at that position and one for computi...
详细信息
Two linear time algorithms are presented. One for determining, for every position in a given square matrix, the longest prefix of a given pattern (also a square matrix) that occurs at that position and one for computing all square covers of a given two-dimensional square matrix.
In the Minimum Common string Partition problem (MCSP), we are given two strings on input, and we wish to partition them into the same collection of substrings, minimizing the number of the substrings in the partition....
详细信息
In the Minimum Common string Partition problem (MCSP), we are given two strings on input, and we wish to partition them into the same collection of substrings, minimizing the number of the substrings in the partition. This problem is NP-hard, even for a special case, denoted 2-MCSP, where each letter occurs at most twice in each input string. We study a greedy algorithm for MCSP that at each step extracts a longest common substring from the given strings. We show that the approximation ratio of this algorithm is between Omega(n(0.43)) and O (n(0)(.6)(9)). In the case of 2-MCSP, we show that the approximation ratio is equal to 3. For 4-MCSP, we give a lower bound of Omega(log n).
Cadences are syntactic regularities in strings, of the family of periods, squares, and repetitions. We say a string has a cadence if a certain character is repeated at regular intervals, possibly with intervening occu...
详细信息
Cadences are syntactic regularities in strings, of the family of periods, squares, and repetitions. We say a string has a cadence if a certain character is repeated at regular intervals, possibly with intervening occurrences of that character. We call the cadence anchored if the first interval must be the same length as the others. Although cadences' combinatorial properties have been explored, little work was done regarding the efficiency of their discovery. Recently, implementations involving cadences appeared in works on phylogenetic reconstruction, periodic subgraph mining, and monitoring events in computer networks. In this paper we begin a systematic study of the efficiency of finding cadences. We first give some basic definitions;we then give a sub-quadratic algorithm for determining whether a string has any cadence consisting of at least three occurrences of a character, and a nearly linear algorithm for finding all anchored cadences;finally, we propose a data structure that captures many features of cadences and allows for the efficient detection of many types of cadences. In particular, all sub-cadences can be detected and reported in time proportional to the sum of their lengths. (C) 2017 Elsevier B.V. All rights reserved.
The Binary Jumbled string Matching Problem is defined as follows: Given a string s over {a, b} of length n and a query (x, y), with x, y non-negative integers, decide whether s has a substring t with exactly x a's...
详细信息
The Binary Jumbled string Matching Problem is defined as follows: Given a string s over {a, b} of length n and a query (x, y), with x, y non-negative integers, decide whether s has a substring t with exactly x a's and y b's. Previous solutions created an index of size O(n) in a pre-processing step, which was then used to answer queries in constant time. The fastest algorithms for construction of this index have running time O(n(2)/logn) (Burcsi et al., 2010 [1];Moosa and Rahman, 2010 [7]), or O(n(2)/log(2) n) in the word-RAM model (Moosa and Rahman, 2012 [8]). We propose an index constructed directly from the run-length encoding of s. The construction time of our index is O(n + rho(2) log rho), where O(n) is the time for computing the run-length encoding of s and rho is the length of this encoding-this is no worse than previous solutions if rho = O(n/logn) and better if rho = O(n/ logn). Our index L can be queried in O(log rho) time. While vertical bar L vertical bar = O(min(n, rho(2))) in the worst case, preliminary investigations have indicated that vertical bar L vertical bar may often be close to rho. Furthermore, the algorithm for constructing the index is conceptually simple and easy to implement. In an attempt to shed light on the structure and size of our index, we characterize it in terms of the prefix normal forms of s introduced in Fici and Liptak (2011) [6]. (C) 2013 Elsevier B.V. All rights reserved.
Given a string s on an alphabet Sigma, a word-length k and a budget D, we want to determine the smallest number of distinct k-mers that can be left in s, if we are allowed to replace up to D letters of s. This problem...
详细信息
Given a string s on an alphabet Sigma, a word-length k and a budget D, we want to determine the smallest number of distinct k-mers that can be left in s, if we are allowed to replace up to D letters of s. This problem has several parameters, and we discuss its complexity under all sorts of restrictions on the parameters values. We prove that some versions of the problem axe polynomial, while others are NP-hard. We also introduce some Integer Programming formulations to model the NP-hard cases.
A string UU for a non-empty string U is called a square. Squares have been well-studied both from a combinatorial and an algorithmic perspective. In this paper, we are the first to consider the problem of maintaining ...
详细信息
ISBN:
(纸本)9783959771245
A string UU for a non-empty string U is called a square. Squares have been well-studied both from a combinatorial and an algorithmic perspective. In this paper, we are the first to consider the problem of maintaining a representation of the squares in a dynamic string S of length at most n. We present an algorithm that updates this representation in n(o)(1) time. This representation allows us to report a longest square-substring of S in O(1) time and all square-substrings of S in O(output) time. We achieve this by introducing a novel tool - maintaining prefix-suffix matches of two dynamic strings. We extend the above result to address the problem of maintaining a representation of all runs (maximal repetitions) of the string. Runs are known to capture the periodic structure of a string, and, as an application, we show that our representation of runs allows us to efficiently answer periodicity queries for substrings of a dynamic string. These queries have proven useful in static pattern matching problems and our techniques have the potential of offering solutions to these problems in a dynamic text setting.
An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem c...
详细信息
ISBN:
(纸本)9783031206238;9783031206245
An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an (O) over tilde (nm(omega-1)) + O(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where omega denotes the matrix multiplication exponent and the (O) over tilde(center dot) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k(2) mG + kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k = 1, the existing algorithm runs in Omega(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm(2) + N) log m) or O(nm(3) + N) time under edit distance. For the decision version of the problem, we present a faster O(nm(2) root log m + N log log m)-time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently.
A factorization of a string w is a partition of w into substrings u(1),..., u(k) such that w = u(1)u(2) ... u(k). Such a partition is called equality-free if no two factors are equal: u(i) not equal u(j), for all i, j...
详细信息
ISBN:
(纸本)9783030389192;9783030389185
A factorization of a string w is a partition of w into substrings u(1),..., u(k) such that w = u(1)u(2) ... u(k). Such a partition is called equality-free if no two factors are equal: u(i) not equal u(j), for all i, j with i not equal j. The maximum equality-free factorization problem is to decide, for a given string w and integer k, whether w admits an equality-free factorization with k factors. Equality-free factorizations have lately received attention because of their application in DNA self-assembly. Condon et al. (CPM 2012) study a version of the problem and show that it is NP-complete to decide if there exists an equality-free factorization with an upper bound on the length of the factors. At STACS 2015, Fernau et al. show that the maximum equality-free factorization problem with a lower bound on the number of factors is NP-complete. Shortly after, Schmid (CiE 2015) presents results concerning the Fixed Parameter Tractability of the problems. In this paper we approach equality free factorizations from a practical point of view i.e. we wish to obtain good solutions on given instances. To this end, we provide approximation algorithms, heuristics, Integer Programming models, an improved FPT algorithm and we also conduct experiments to analyze the performance of our proposed algorithms. Additionally, we study a relaxed version of the problem where gaps are allowed between factors and we design a constant factor approximation algorithm for this case. Surprisingly, after extensive experiments we conjecture that the relaxed problem has the same optimum as the original.
暂无评论