Advancements in sequencing technology have brought a large increase in the quantity of raw sequencing data. To cope with the ever growing sequence data sets, efficient read mappers are developed to accurately reconstr...
详细信息
Advancements in sequencing technology have brought a large increase in the quantity of raw sequencing data. To cope with the ever growing sequence data sets, efficient read mappers are developed to accurately reconstruct entire genomes from fragments, by mapping fragments of the genome of an organism to a high quality reference genome. For each fragment, a mapper tries to find a distinct correspond- ing highly similar sequence in the reference, or multiple highly similar sequences if applicable. Despite improvements of the mapping software, modern state-of-the-art mappers struggle between speed and sensitivity: as a mapper increases sensitiv- ity to tolerate more sequencer errors and genetic variations, it inevitably enlarges the search space and spend more time comparing fragments against increasingly dissimilar reference sequences. Comparisons over such false mappings are often unnecessary and wasteful, as dissimilar reference sequences are not considered as meaningful mappings. Furthermore, although only a small fraction of all DNA frag- ments require high mapping sensitivity due to large number of embedded errors, mappers have to raise sensitivity against all fragments, as the error profile of each fragment is unknown to the mapper. In this dissertation, we provide multiple algorithms and implementations that aim to reduce the amount of unnecessary computation spent on false mappings while achieving high sensitivity by 1) quickly and accurately filtering out false mappings and 2) reducing the total number of false mappings without sacrificing sensitivity through improved seeding mechanisms. Specifically, we designed SIMD-friendly algorithms that quickly identify false mappings with high accuracy. We also ex- tended a previously proposed approximate string matching algorithm to better suit biological applications. Finally, we developed multiple methods that enhance the popular seed-and-extend mapping strategy by increasing seed selectivity without re- ducing mapp
Cadences are syntactic regularities in strings, of the family of periods, squares, and repetitions. We say a string has a cadence if a certain character is repeated at regular intervals, possibly with intervening occu...
详细信息
Cadences are syntactic regularities in strings, of the family of periods, squares, and repetitions. We say a string has a cadence if a certain character is repeated at regular intervals, possibly with intervening occurrences of that character. We call the cadence anchored if the first interval must be the same length as the others. Although cadences' combinatorial properties have been explored, little work was done regarding the efficiency of their discovery. Recently, implementations involving cadences appeared in works on phylogenetic reconstruction, periodic subgraph mining, and monitoring events in computer networks. In this paper we begin a systematic study of the efficiency of finding cadences. We first give some basic definitions;we then give a sub-quadratic algorithm for determining whether a string has any cadence consisting of at least three occurrences of a character, and a nearly linear algorithm for finding all anchored cadences;finally, we propose a data structure that captures many features of cadences and allows for the efficient detection of many types of cadences. In particular, all sub-cadences can be detected and reported in time proportional to the sum of their lengths. (C) 2017 Elsevier B.V. All rights reserved.
y Range minimum query is an important building brick of many compressed data structures and string matching algorithms. Although this problem is essentially solved in theory, with sophisticated data structures allowin...
详细信息
y Range minimum query is an important building brick of many compressed data structures and string matching algorithms. Although this problem is essentially solved in theory, with sophisticated data structures allowing for constant time queries, practical performance and construction time also matter. Additionally, there are offline scenarios in which the number of queries, ie, q, is rather small and given beforehand, which encourages to use a simpler approach. In this work, we present a simple data structure, with very fast construction, which allows to handle queries in constant time on average. This algorithm, however, requires access to the input data during queries (which is not the case of sophisticated range minimum query solutions). We subsequently refine our technique, combining it with one of the existing succinct solutions with O(1) worst-case time queries and no access to the input array. The resulting hybrid is still a memory frugal data structure, spending usually up to about 3n bits and providing competitive query times, especially for wide ranges. We also show how to make our baseline data structure more compact. Experimental results demonstrate that the proposed block-based sparse table (BbST) variants are competitive to existing solutions, also in the offline scenario.
We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture [R. M. Kolpakov and G. Kuche...
详细信息
We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture [R. M. Kolpakov and G. Kucherov, Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), IEEE Computer Society, Los Alamitos, CA, 1999, pp. 596-604]), which states that the maximum number of runs rho(n) in a string of length n is less than n. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of 3n for the maximum sum of exponents sigma(n) of runs in a string of length n, improving on the best known bound of 4.1n by Crochemore et al. [J. Discrete algorithms, 14 (2012), pp. 29-36], as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-period query problem that was recently solved by Kociumaka et al.
We present an improved approximation for the Maximum Duo-Preservation string Mapping Problem (MPSM). This problem was introduced in [7] as the complement to the well-studied Minimum Common string Partition problem (MC...
详细信息
ISBN:
(数字)9783319436814
ISBN:
(纸本)9783319436814;9783319436807
We present an improved approximation for the Maximum Duo-Preservation string Mapping Problem (MPSM). This problem was introduced in [7] as the complement to the well-studied Minimum Common string Partition problem (MCSP). Prior work also considers the k-MPSM and k-MCSP variants in which each letter occurs at most k times. The authors of [7] showed a k(2)-appoximation for k >= 3 and 2-approximation for k = 2. A 4-approximation independent of k was shown in [4]. In [4], they also showed that k-MPSM is APX-Hard and achieved approximation ratios of 8/5 for k = 2 and 3 for k = 3. In this paper, we show an algorithm which achieves a 13/4-approximation for the general MPSM problem using a new combinatorial triplet matching approach. During publication of this paper, [3] presented a local search algorithm yielding 7/2, which falls in between the previous best and this paper. The remainder of the paper has not been altered to reflect this.
We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals [i, j] where each term is non-zero only...
详细信息
We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals [i, j] where each term is non-zero only if the substring equals a prespecified pattern w. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively , and where L is the combined length of input patterns, is the maximum length of a pattern, and D is the input alphabet. This improves on the previous algorithms of Ye et al. (NIPS, 2009) whose complexities are respectively , and , where is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.
In this paper we investigate jumbled (Abelian) versions of three classical strings problems. In all these problems we assume the input string S[1..n] is given in its run-length format S'[1..r]. The Jumbled Indexin...
详细信息
In this paper we investigate jumbled (Abelian) versions of three classical strings problems. In all these problems we assume the input string S[1..n] is given in its run-length format S'[1..r]. The Jumbled Indexing problem is the problem of indexing a string S'[1..r] over vertical bar Sigma vertical bar for histogram queries, i.e. given a pattern P, we want to find all substrings of S that are permutations of P. We provide an algorithm that constructs an index of size O (r(2)vertical bar Sigma vertical bar) in time O (r(2)(logr + vertical bar Sigma vertical bar log vertical bar Sigma vertical bar)), which allows answering histogram queries in O(vertical bar E vertical bar(3) log r)-time. The Jumbled Border problem is the problem of finding for every location j in S, the longest proper prefix of S[1.. j] that is also a permutation of a proper suffix of 5[1.. j], if such exists. We provide an algorithm that solves this problem in O(vertical bar Sigma vertical bar(r(2) +n)) time, and O(vertical bar Sigma vertical bar n) space. A Jumbled Square is a string of the form x (x) over bar, where (x) over bar is a permutation of x. The Jumbled Square problem is the problem of finding for every location j in S, the longest jumbled square that ends in j, if such exists. We provide an algorithm that solves this problem in O(vertical bar Sigma vertical bar(r(2) + n)) time, and O(vertical bar Sigma vertical bar n) space. (C) 2016 Elsevier B.V. All rights reserved.
We study a simple algorithm generating square-free words from a random source. The source produces uniformly distributed random letters from a k-ary alphabet, and the algorithm outputs a (k+1)-ary square-free word. We...
详细信息
We study a simple algorithm generating square-free words from a random source. The source produces uniformly distributed random letters from a k-ary alphabet, and the algorithm outputs a (k+1)-ary square-free word. We are interested in the "conversion ratio" between the lengths of the input random word and the output square-free word. For any k >= 3 we prove the expected value of this ratio to be a constant and calculate it up to an O(1/k(5)) term. For the extremal case of ternary square-free words, we suggest this ratio to have a constant expectation as well and conjecture its actual value from computer experiments. (C) 2015 Elsevier B.V. All rights reserved.
Finding an approximate period in a given string S of length n is defined as follows. Let S' be a periodic string closest to Sunder some distance metric, find the smallest period of S'. This period is called an...
详细信息
Finding an approximate period in a given string S of length n is defined as follows. Let S' be a periodic string closest to Sunder some distance metric, find the smallest period of S'. This period is called an approximate period of Sunder the given metric. Let the distance between the input string Sand a closest periodic string under the Hamming distance S' be k. We develop algorithms that construct an approximate period of Sunder the Hamming distance in time O (nk log logn) and under the swap distance in time O (n(2)). Finally, we show an O (n logn) algorithm for finite alphabets, and an O (n log(3) n) algorithm for infinite alphabets, that approximate the minimum number of mismatches between the input string and a closest periodic string under the Hamming distance. (C) 2015 Elsevier Inc. All rights reserved.
The recently introduced longest common substring with k-mismatches (k-LCF) problem is to find, given two sequences S-1 and S-2 of length n each, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming di...
详细信息
The recently introduced longest common substring with k-mismatches (k-LCF) problem is to find, given two sequences S-1 and S-2 of length n each, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) is at most k. So far, the only subquadratic time result for this problem was known for k = 1 [5]. We present two output-dependent algorithms solving the k-LCF problem and show that for k = O (log(1-epsilon) n), where epsilon > 0, at least one of them works in subquadratic time, using O(n) words of space. The choice of one of these two algorithms to be applied for a given input can be done after linear time and space preprocessing. (c) 2015 Elsevier B.V. All rights reserved.
暂无评论