检索结果-内蒙古大学图书馆

algorithms FOR JUMBLED PATTERN MATCHING IN stringS

INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE 2012年第2期23卷 357-374页

作者： Burcsi, Peter Cicalese, Ferdinando Fici, Gabriele Liptak, Zsuzsanna Eotvos Lorand Univ Fac Informat Dept Comp Algebra H-1117 Budapest Hungary Univ Salerno Dipartimento Informat & Applicaz I-84084 Fisciano SA Italy Univ Nice Sophia Antipolis CNRS Lab I3S F-06903 Sophia Antipolis France Univ Bielefeld AG Genominformat Tech Fak D-33501 Bielefeld Germany

The Parikh vector p(s) of a string s over a finite ordered alphabet Sigma = {a(1), . . . , a(sigma)} is defined as the vector of multiplicities of the characters, p(s) = (p(1), . . . , p(sigma)), where p(i) = vertical bar{j vertical bar s(j) = a(i)}vertical bar. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Theta(n(2)) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size sigma >= 2 and has sub-linear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n(sigma/log sigma)(1/2) log m/root m), where m = Sigma(i) q(i) is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to O(n(sigma/log sigma)(1/2) 1 root m), i.e., by a factor of log m.

关键词： Parikh vectors permuted strings pattern matching string algorithms average case analysis

来源：评论

学校读者我要写书评

暂无评论

ELASTIC-DEGENERATE string MATCHING VIA FAST MATRIX MULTIPLICATION

引用

SIAM JOURNAL ON COMPUTING 2022年第3期51卷 549-576页

作者： Bernardini, Giulia Gawrychowski, Pawe L. Pisanti, Nadia Pissis, Solon P. Rosone, Giovanna CWI NL-1090 GB Amsterdam Netherlands Univ Wroclaw PL-50370 Wroclaw Poland Univ Pisa I-56126 Pisa Italy Vrije Univ Amsterdam NL-1081 HV Amsterdam Netherlands

An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm(1.5) root logm+N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction, we show that a combinatorial algorithm solving the EDSM problem in O(nm(1.5-epsilon) + N) time, for any epsilon > 0, refutes this conjecture. Our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. string periodicity and fast Fourier transform are two standard tools in string algorithms. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a noncombinatorial (O) over tilde (nm(omega-1) + N)-time algorithm for EDSM, where \omega denotes the matrix multiplication exponent and the (O) over tilde(center dot) notation suppresses polylog factors. To the best of our knowledge, we are the first to combine these tools. In particular, using the fact that omega < 2.373 [Alman and Williams, SODA 2021;Le Gall, ISSAC

关键词： string algorithms pattern matching elastic-degenerate string matrix multiplication fast Fourier transform

来源：评论

学校读者我要写书评

暂无评论

An algorithm for mapping short reads to a dynamically changing genomic sequence

引用

JOURNAL OF DISCRETE algorithms 2012年第1期10卷 15-22页

作者： Iliopoulos, Costas S. Kourie, Derrick Mouchard, Laurent Musombuka, Themba K. Pissis, Solon P. de Ridder, Corne Kings Coll London Dept Informat Strand London WC2R 2LS England Curtin Univ Ctr Stringol Applicat Digital Ecosystems Business Intelligence Inst Perth WA 6845 Australia Univ Pretoria Dept Comp Sci ZA-0002 Pretoria South Africa Univ Rouen Syst & Informat Proc LITIS EA 4108 F-76821 Mont St Aignan Cedex France Univ South Africa Sch Comp ZA-0003 Pretoria South Africa

Next-generation sequencing technologies have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to the dramatic increase in the amount of data generated, a challenging task is to map (align) a set of reads to a reference genome. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be extended and applied to address the problem of mapping short reads to multiple related genomes. (C) 2011 Elsevier B.V. All rights reserved.

关键词： string algorithms Next-generation sequencing Mapping

来源：评论

学校读者我要写书评

暂无评论

The three squares lemma revisited

引用

JOURNAL OF DISCRETE algorithms 2012年第1期11卷 3-14页

作者： Kopylova, Evguenia Smyth, W. F. McMaster Univ Dept Comp & Software Algorithms Res Grp Hamilton ON L8S 4K1 Canada Univ Lille 1 Bonsai LIFL Lille France Curtin Univ Digital Ecosyst & Business Intelligence Inst Ctr Stringol & Applicat Perth WA 6845 Australia

A recent paper Fan et al. (2006) [10] showed that the occurrence of two squares at the same position in a string, together with the occurrence of a third near by, is possible only in very special circumstances, represented by 14 well-defined cases. Similar results were published in Simpson (2007) [19]. In this paper we begin the process of extending this research in two ways: first, by proving a "two squares" lemma for a case not considered in Fan et al. (2006) [10];second, by showing that in other cases, when three squares occur, more precise results - a breakdown into highly periodic substrings easily recognized in a left-to-right scan of the string - can be obtained with weaker assumptions. The motivation for this research is, first, to show that the maximum number of runs (maximal periodicities) in a string is at most n;second, and more important, to provide a combinatorial basis for a new generation of algorithms that directly compute repetitions in strings without elaborate preprocessing. Based on extensive computation, we present conjectures that describe the combinatorial behavior in all 14 of the subcases that arise. We then prove the correctness of seven of these conjectures. Along the way we establish a new combinatorial lemma characterizing strings of which two rotations have the same period. (C) 2011 Elsevier B.V. All rights reserved.

关键词： Combinatorics on words string algorithms Maximal periodicities Runs Repetitions Three squares lemma

来源：评论

学校读者我要写书评

暂无评论

On Computing Average Common Substring Over Run Length Encoded Sequences

引用

FUNDAMENTA INFORMATICAE 2018年第3期163卷 267-273页

作者： Hooshmand, Sahar Tavakoli, Neda Abedin, Paniz Thankachan, Sharma V. Univ Cent Florida Dept Comp Sci 117 Harris Ctr Bldg 1164000 Cent Florida Blvd Orlando FL 32816 USA Georgia Inst Technol Sch Computat Sci & Engn Atlanta GA 30332 USA

The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS of a sequence X[1, x] w.r.t. another sequence Y[1, y] is ACS(X, Y) = 1/x Sigma(x)(i=1) max lcp(X[i, x], Y[j, y]) The lcp(., .) of two input sequences is the length of their longest common prefix. The ACS can be computed in O(n) space and time, where n = x + y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task must be performed with little or no decompression. In this paper, we revisit the ACS problem under this paradigm where the input sequences are given in their run-length encoded format. We present an algorithm to compute ACS(X, Y) in O(N logN) time using O(N) space, where N is the total length of sequences after run-length encoding.

关键词： string algorithms Suffix Trees RL Encoding Compression

来源：评论

学校读者我要写书评

暂无评论

A linear algorithm for string reconstruction in the reverse complement equivalence model

引用

JOURNAL OF DISCRETE algorithms 2012年 14卷 37-54页

作者： Cicalese, Ferdinando Erdos, Peter L. Liptak, Zsuzsanna Univ Salerno Dipartimento Informat & Applicaz Fisciano Italy Hungarian Acad Sci A Renyi Inst Math Budapest Hungary Bielefeld Univ AG Genominformat Tech Fak Bielefeld Germany

In the reverse complement equivalence model, it is not possible to distinguish a string from its reverse complement. We show that one can still reconstruct a string of length n, up to reverse complement, using a linear number of subsequence queries of bounded length. We first give the proof for strings over a binary alphabet, and then extend it to arbitrary finite alphabets. A simple information theoretic lower bound proves the number of queries to be asymptotically tight. Furthermore, our result is optimal w.r.t. the bound on the query length given in Erdos et al. (2006) [6]. (C) 2011 Elsevier B.V. All rights reserved.

关键词： string reconstruction Reverse complement string algorithms Subsequences Subwords Combinatorics on words

来源：评论

学校读者我要写书评

暂无评论

A Linear Space Data Structure for Range LCP Queries

引用

FUNDAMENTA INFORMATICAE 2018年第3期163卷 245-251页

作者： Ganguly, Arnab Patil, Manish Shah, Rahul Thankachan, Sharma V. Univ Wisconsin Dept Comp Sci Whitewater WI 53190 USA Facebook Inc Menlo Pk CA USA Louisiana State Univ Dept Comp Sci Baton Rouge LA 70803 USA Univ Cent Florida Dept Comp Sci 117 Harris Ctr Bldg 1164000 Cent Florida Blvd Orlando FL 32816 USA

Range LCP ( longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string S[1...n] of n characters, such that whenever an interval [i, j] comes as a query, we can report max{vertical bar LCP(S-p, S-q) vertical bar vertical bar i <= p < q <= j} Here LCP(S-p, S-q) is the longest common prefix of the suffixes of S starting at locations p and q, and vertical bar LCP(S-p, S-q)j is its length. This problem was first addressed by Amir et al. [ISAAC, 2011]. They showed that the query can be answered in O(log log n) time using an O(n log(1+epsilon) n) space data structure for an arbitrarily small constant epsilon > 0. In an attempt to reduce the space bound, they presented a linear space data structure of O(d log log n) query time, where d = (j-i+1) In this paper, we present a new linear space data structure with an improved query time of O (root dlog d/(log n)(1/2-epsilon)).

关键词： string algorithms Suffix Trees Range Query

来源：评论

学校读者我要写书评

暂无评论

Computing Maximal Lyndon Substrings of a string

引用

algorithms 2020年第11期13卷 294页

作者： Franek, Frantisek Liut, Michael McMaster Univ Dept Comp & Software Hamilton ON L8S 4K1 Canada Univ Toronto Dept Math & Computat Sci Mississauga ON L5L 1C6 Canada

There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all right-maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of right-maximal Lyndon substrings of the input string. Among those presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on the iterated Duval algorithm for Lyndon factorization and a linear algorithmic scheme based on linear suffix sorting, computing the inverse suffix array, and applying to it the next smaller value algorithm. Duval's algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier's algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has O(nlog(n)) worst-case complexity, it is interesting as it emulates the fast Fourier algorithm's recursive approach and introduces tau-reduction, which might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier's algorithm for suffix sorting. This paper presents the theoretical analysis of these two algorithms and provides empirical comparisons of both of their C++ implementations with respect to the iterated Duval algorithm.

关键词： combinatorics on words string algorithms regularities in strings suffix sorting Lyndon substrings Lyndon arrays right-maximal Lyndon substrings tau-reduction algorithm Baier's sort algorithm iterative Duval algorithm

来源：评论

学校读者我要写书评

暂无评论

Parallel Longest Common SubSequence Analysis In Chapel

Parallel Longest Common SubSequence Analysis In Chapel

引用

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Vahidi, Soroush Schieber, Baruch Du, Zhihui Bader, David A. New Jersey Inst Technol Dept Comp Sci Newark NJ 07102 USA New Jersey Inst Technol Dept Data Sci Newark NJ USA

ISBN: (纸本)9798350308600

One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu's parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm's performance.

关键词： string algorithms parallel computing Chapel programming language

来源：评论

学校读者我要写书评

暂无评论

Hide and Mine in strings: Hardness and algorithms 20

Hide and Mine in Strings: Hardness and Algorithms

引用

20th IEEE International Conference on Data Mining (ICDM)

作者： Bernardini, Giulia Conte, Alessio Gourdel, Garance Grossi, Roberto Loukides, Grigorios Pisanti, Nadia Pissis, Solon P. Punzi, Giulia Stougie, Leen Sweering, Michelle Univ Milano Bicocca Milan Italy Univ Pisa Pisa Italy ENS Paris Saclay Ecole Normale Super Inria Rennes Rennes France ERABLE Team Montbonnot St Martin France Kings Coll London London England CWI Amsterdam Netherlands Vrije Univ Amsterdam Netherlands

ISBN: (纸本)9781728183169

We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns introducing, however, a number of spurious patterns that may harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is twofold. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.

关键词： data privacy data sanitization knowledge hiding frequent pattern mining string algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：