检索结果-内蒙古大学图书馆

作者： Hongyi Xin Carnegie Mellon University

学位级别：博士

Advancements in sequencing technology have brought a large increase in the quantity of raw sequencing data. To cope with the ever growing sequence data sets, efficient read mappers are developed to accurately reconstruct entire genomes from fragments, by mapping fragments of the genome of an organism to a high quality reference genome. For each fragment, a mapper tries to find a distinct correspond- ing highly similar sequence in the reference, or multiple highly similar sequences if applicable. Despite improvements of the mapping software, modern state-of-the-art mappers struggle between speed and sensitivity: as a mapper increases sensitiv- ity to tolerate more sequencer errors and genetic variations, it inevitably enlarges the search space and spend more time comparing fragments against increasingly dissimilar reference sequences. Comparisons over such false mappings are often unnecessary and wasteful, as dissimilar reference sequences are not considered as meaningful mappings. Furthermore, although only a small fraction of all DNA frag- ments require high mapping sensitivity due to large number of embedded errors, mappers have to raise sensitivity against all fragments, as the error profile of each fragment is unknown to the mapper. In this dissertation, we provide multiple algorithms and implementations that aim to reduce the amount of unnecessary computation spent on false mappings while achieving high sensitivity by 1) quickly and accurately filtering out false mappings and 2) reducing the total number of false mappings without sacrificing sensitivity through improved seeding mechanisms. Specifically, we designed SIMD-friendly algorithms that quickly identify false mappings with high accuracy. We also ex- tended a previously proposed approximate string matching algorithm to better suit biological applications. Finally, we developed multiple methods that enhance the popular seed-and-extend mapping strategy by increasing seed selectivity without re- ducing mapp

关键词： Read Mapping Approximate string Matching Seed-and-Extend Landau-Vishkin Error Tolerance string algorithms Suffix Trie Edit Distance Affine Gap

来源：评论

学校读者我要写书评

暂无评论

string cadences

引用

THEORETICAL COMPUTER SCIENCE 2017年 698卷 4-8页

作者： Amir, Amihood Apostolico, Alberto Gagie, Travis Landau, Gad M. Bar Ilan Univ Dept Comp Sci IL-52900 Ramat Gan Israel Johns Hopkins Univ Dept Comp Sci Baltimore MD 21218 USA Georgia Inst Technol Sch Computat Sci & Engn Coll Comp Klaus Adv Comp Bldg266 Ferst Dr Atlanta GA 30332 USA Diego Portales Univ Sch Comp Sci & Telecommun Santiago Chile Univ Haifa Dept Comp Sci IL-31905 Haifa Israel NYU Tandon Sch Engn Dept Comp Sci & Engn MetroTech Ctr 2 Brooklyn NY 11201 USA

Cadences are syntactic regularities in strings, of the family of periods, squares, and repetitions. We say a string has a cadence if a certain character is repeated at regular intervals, possibly with intervening occurrences of that character. We call the cadence anchored if the first interval must be the same length as the others. Although cadences' combinatorial properties have been explored, little work was done regarding the efficiency of their discovery. Recently, implementations involving cadences appeared in works on phylogenetic reconstruction, periodic subgraph mining, and monitoring events in computer networks. In this paper we begin a systematic study of the efficiency of finding cadences. We first give some basic definitions;we then give a sub-quadratic algorithm for determining whether a string has any cadence consisting of at least three occurrences of a character, and a nearly linear algorithm for finding all anchored cadences;finally, we propose a data structure that captures many features of cadences and allows for the efficient detection of many types of cadences. In particular, all sub-cadences can be detected and reported in time proportional to the sum of their lengths. (C) 2017 Elsevier B.V. All rights reserved.

关键词： string algorithms string regularities Pattern mining Cadences

来源：评论

学校读者我要写书评

暂无评论

Faster range minimum queries

引用

SOFTWARE-PRACTICE & EXPERIENCE 2018年第11期48卷 2043-2060页

作者： Kowalski, Tomasz M. Grabowski, Szymon Lodz Univ Technol Inst Appl Comp Sci Al Politech 11 PL-90924 Lodz Poland

y Range minimum query is an important building brick of many compressed data structures and string matching algorithms. Although this problem is essentially solved in theory, with sophisticated data structures allowing for constant time queries, practical performance and construction time also matter. Additionally, there are offline scenarios in which the number of queries, ie, q, is rather small and given beforehand, which encourages to use a simpler approach. In this work, we present a simple data structure, with very fast construction, which allows to handle queries in constant time on average. This algorithm, however, requires access to the input data during queries (which is not the case of sophisticated range minimum query solutions). We subsequently refine our technique, combining it with one of the existing succinct solutions with O(1) worst-case time queries and no access to the input array. The resulting hybrid is still a memory frugal data structure, spending usually up to about 3n bits and providing competitive query times, especially for wide ranges. We also show how to make our baseline data structure more compact. Experimental results demonstrate that the proposed block-based sparse table (BbST) variants are competitive to existing solutions, also in the offline scenario.

关键词： bulk queries range minimum query string algorithms

来源：评论

学校读者我要写书评

暂无评论

THE "RUNS" THEOREM

引用

SIAM JOURNAL ON COMPUTING 2017年第5期46卷 1501-1514页

作者： Bannai, Hideo Tomohiro, I Inenaga, Shunsuke Nakashima, Yuto Takeda, Masayuki Tsuruta, Kazuya Kyushu Univ Dept Informat Nishi Ku Fukkuoka 8190395 Japan TU Dortmund Dept Comp Sci D-44221 Dortmund Germany Kyushu Inst Technol Dept Artificial Intelligence Fukkuoka Japan

We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture [R. M. Kolpakov and G. Kucherov, Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), IEEE Computer Society, Los Alamitos, CA, 1999, pp. 596-604]), which states that the maximum number of runs rho(n) in a string of length n is less than n. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of 3n for the maximum sum of exponents sigma(n) of runs in a string of length n, improving on the best known bound of 4.1n by Crochemore et al. [J. Discrete algorithms, 14 (2012), pp. 29-36], as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-period query problem that was recently solved by Kociumaka et al.

关键词： maximal repetitions Lyndon words combinatorics on words string algorithms

来源：评论

学校读者我要写书评

暂无评论

Further Improvement in Approximating the Maximum Duo-Preservation string Mapping Problem 1

引用

16th International Workshop on algorithms in Bioinformatics (WABI)

作者： Brubach, Brian Univ Maryland Dept Comp Sci College Pk MD 20742 USA

ISBN: (数字)9783319436814

ISBN: (纸本)9783319436814;9783319436807

We present an improved approximation for the Maximum Duo-Preservation string Mapping Problem (MPSM). This problem was introduced in [7] as the complement to the well-studied Minimum Common string Partition problem (MCSP). Prior work also considers the k-MPSM and k-MCSP variants in which each letter occurs at most k times. The authors of [7] showed a k(2)-appoximation for k >= 3 and 2-approximation for k = 2. A 4-approximation independent of k was shown in [4]. In [4], they also showed that k-MPSM is APX-Hard and achieved approximation ratios of 8/5 for k = 2 and 3 for k = 3. In this paper, we show an algorithm which achieves a 13/4-approximation for the general MPSM problem using a new combinatorial triplet matching approach. During publication of this paper, [3] presented a local search algorithm yielding 7/2, which falls in between the previous best and this paper. The remainder of the paper has not been altered to reflect this.

关键词： string algorithms Polynomial-time approximation Max Duo-Preservation string Mapping Problem Min Common string Partition Problem

来源：评论

学校读者我要写书评

暂无评论

Inference algorithms for Pattern-Based CRFs on Sequence Data

引用

ALGORITHMICA 2016年第1期76卷 17-46页

作者： Kolmogorov, Vladimir Takhanov, Rustem IST Austria Campus 1 A-3400 Klosterneuburg Austria Nazarbayev Univ Dept Math 53 Qabanbay Batyr Ave Astana 010000 Kazakhstan

We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals [i, j] where each term is non-zero only if the substring equals a prespecified pattern w. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively , and where L is the combined length of input patterns, is the maximum length of a pattern, and D is the input alphabet. This improves on the previous algorithms of Ye et al. (NIPS, 2009) whose complexities are respectively , and , where is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.

关键词： Conditional random fields Sequence tagging string algorithms

来源：评论

学校读者我要写书评

暂无评论

algorithms for Jumbled Indexing, Jumbled Border and Jumbled Square on run-length encoded strings

引用

THEORETICAL COMPUTER SCIENCE 2016年 656卷 146-159页

作者： Amir, Amihood Apostolico, Alberto Hirst, Tirza Landau, Gad M. Lewenstein, Noa Rozenberg, Liat Bar Ilan Univ IL-52100 Ramat Gan Israel Johns Hopkins Univ Baltimore MD 21218 USA Georgia Tech Atlanta GA USA CNR IASI I-00185 Rome Italy Machon Tal Jerusalem Israel Univ Haifa IL-31999 Haifa Israel NYU New York NY 10003 USA Netanya Coll Netanya Israel

In this paper we investigate jumbled (Abelian) versions of three classical strings problems. In all these problems we assume the input string S[1..n] is given in its run-length format S'[1..r]. The Jumbled Indexing problem is the problem of indexing a string S'[1..r] over vertical bar Sigma vertical bar for histogram queries, i.e. given a pattern P, we want to find all substrings of S that are permutations of P. We provide an algorithm that constructs an index of size O (r(2)vertical bar Sigma vertical bar) in time O (r(2)(logr + vertical bar Sigma vertical bar log vertical bar Sigma vertical bar)), which allows answering histogram queries in O(vertical bar E vertical bar(3) log r)-time. The Jumbled Border problem is the problem of finding for every location j in S, the longest proper prefix of S[1.. j] that is also a permutation of a proper suffix of 5[1.. j], if such exists. We provide an algorithm that solves this problem in O(vertical bar Sigma vertical bar(r(2) +n)) time, and O(vertical bar Sigma vertical bar n) space. A Jumbled Square is a string of the form x (x) over bar, where (x) over bar is a permutation of x. The Jumbled Square problem is the problem of finding for every location j in S, the longest jumbled square that ends in j, if such exists. We provide an algorithm that solves this problem in O(vertical bar Sigma vertical bar(r(2) + n)) time, and O(vertical bar Sigma vertical bar n) space. (C) 2016 Elsevier B.V. All rights reserved.

关键词： Pattern Matching Jumbled Indexing Abelian border Abelian square string algorithms

来源：评论

学校读者我要写书评

暂无评论

Generating square-free words efficiently

引用

THEORETICAL COMPUTER SCIENCE 2015年 601卷 67-72页

作者： Shur, Arseny M. Ural Fed Univ Ekaterinburg Russia

We study a simple algorithm generating square-free words from a random source. The source produces uniformly distributed random letters from a k-ary alphabet, and the algorithm outputs a (k+1)-ary square-free word. We are interested in the "conversion ratio" between the lengths of the input random word and the output square-free word. For any k >= 3 we prove the expected value of this ratio to be a constant and calculate it up to an O(1/k(5)) term. For the extremal case of ternary square-free words, we suggest this ratio to have a constant expectation as well and conjecture its actual value from computer experiments. (C) 2015 Elsevier B.V. All rights reserved.

关键词： Square-free word Random word string algorithms

来源：评论

学校读者我要写书评

暂无评论

Approximate periodicity

引用

INFORMATION AND COMPUTATION 2015年 241卷 215-226页

作者： Amir, Amihood Eisenberg, Estrella Levy, A. Bar Ilan Univ Dept Comp Sci IL-52900 Ramat Gan Israel Shenkar Coll Dept Software Engn Ramat Gan Israel Univ Haifa CRI IL-31905 Haifa Israel Johns Hopkins Univ Dept Comp Sci Baltimore MD 21218 USA

Finding an approximate period in a given string S of length n is defined as follows. Let S' be a periodic string closest to Sunder some distance metric, find the smallest period of S'. This period is called an approximate period of Sunder the given metric. Let the distance between the input string Sand a closest periodic string under the Hamming distance S' be k. We develop algorithms that construct an approximate period of Sunder the Hamming distance in time O (nk log logn) and under the swap distance in time O (n(2)). Finally, we show an O (n logn) algorithm for finite alphabets, and an O (n log(3) n) algorithm for infinite alphabets, that approximate the minimum number of mismatches between the input string and a closest periodic string under the Hamming distance. (C) 2015 Elsevier Inc. All rights reserved.

关键词： string algorithms Periodicity Approximate periodicity Approximate matching Hamming distance Swap distance

来源：评论

学校读者我要写书评

暂无评论

A note on the longest common substring with k-mismatches problem

引用

INFORMATION PROCESSING LETTERS 2015年第6-8期115卷 640-642页

作者： Grabowski, Szymon Lodz Univ Technol Inst Appl Comp Sci PL-90924 Lodz Poland

The recently introduced longest common substring with k-mismatches (k-LCF) problem is to find, given two sequences S-1 and S-2 of length n each, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) is at most k. So far, the only subquadratic time result for this problem was known for k = 1 [5]. We present two output-dependent algorithms solving the k-LCF problem and show that for k = O (log(1-epsilon) n), where epsilon > 0, at least one of them works in subquadratic time, using O(n) words of space. The choice of one of these two algorithms to be applied for a given input can be done after linear time and space preprocessing. (c) 2015 Elsevier B.V. All rights reserved.

关键词： Combinatorial problems string algorithms Longest common substring with k-mismatches Word-level parallelism

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：