The Parikh vector p(s) of a string s over a finite ordered alphabet Sigma = {a(1), . . . , a(sigma)} is defined as the vector of multiplicities of the characters, p(s) = (p(1), . . . , p(sigma)), where p(i) = vertical...
详细信息
The Parikh vector p(s) of a string s over a finite ordered alphabet Sigma = {a(1), . . . , a(sigma)} is defined as the vector of multiplicities of the characters, p(s) = (p(1), . . . , p(sigma)), where p(i) = vertical bar{j vertical bar s(j) = a(i)}vertical bar. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Theta(n(2)) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size sigma >= 2 and has sub-linear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n(sigma/log sigma)(1/2) log m/root m), where m = Sigma(i) q(i) is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to O(n(sigma/log sigma)(1/2) 1 root m), i.e., by a factor of log m.
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences ...
详细信息
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm(1.5) root logm+N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction, we show that a combinatorial algorithm solving the EDSM problem in O(nm(1.5-epsilon) + N) time, for any epsilon > 0, refutes this conjecture. Our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. string periodicity and fast Fourier transform are two standard tools in string algorithms. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a noncombinatorial (O) over tilde (nm(omega-1) + N)-time algorithm for EDSM, where \omega denotes the matrix multiplication exponent and the (O) over tilde(center dot) notation suppresses polylog factors. To the best of our knowledge, we are the first to combine these tools. In particular, using the fact that omega < 2.373 [Alman and Williams, SODA 2021;Le Gall, ISSAC
Next-generation sequencing technologies have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower co...
详细信息
Next-generation sequencing technologies have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads), during a single experiment, and with a much lower cost than previously possible. Due to the dramatic increase in the amount of data generated, a challenging task is to map (align) a set of reads to a reference genome. In this paper, we study a different version of this problem: mapping these reads to a dynamically changing genomic sequence. We propose a new practical algorithm, which employs a suitable data structure that takes into account potential dynamic effects (replacements, insertions, deletions) on the genomic sequence. The presented experimental results demonstrate that the proposed approach can be extended and applied to address the problem of mapping short reads to multiple related genomes. (C) 2011 Elsevier B.V. All rights reserved.
A recent paper Fan et al. (2006) [10] showed that the occurrence of two squares at the same position in a string, together with the occurrence of a third near by, is possible only in very special circumstances, repres...
详细信息
A recent paper Fan et al. (2006) [10] showed that the occurrence of two squares at the same position in a string, together with the occurrence of a third near by, is possible only in very special circumstances, represented by 14 well-defined cases. Similar results were published in Simpson (2007) [19]. In this paper we begin the process of extending this research in two ways: first, by proving a "two squares" lemma for a case not considered in Fan et al. (2006) [10];second, by showing that in other cases, when three squares occur, more precise results - a breakdown into highly periodic substrings easily recognized in a left-to-right scan of the string - can be obtained with weaker assumptions. The motivation for this research is, first, to show that the maximum number of runs (maximal periodicities) in a string is at most n;second, and more important, to provide a combinatorial basis for a new generation of algorithms that directly compute repetitions in strings without elaborate preprocessing. Based on extensive computation, we present conjectures that describe the combinatorial behavior in all 14 of the subcases that arise. We then prove the correctness of seven of these conjectures. Along the way we establish a new combinatorial lemma characterizing strings of which two rotations have the same period. (C) 2011 Elsevier B.V. All rights reserved.
The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS of a sequence X[1, x] w.r.t. another sequence Y[1, y] is ACS(X, Y) = 1/x Sigma(x)(i=1) max lcp(X[i,...
详细信息
The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS of a sequence X[1, x] w.r.t. another sequence Y[1, y] is ACS(X, Y) = 1/x Sigma(x)(i=1) max lcp(X[i, x], Y[j, y]) The lcp(., .) of two input sequences is the length of their longest common prefix. The ACS can be computed in O(n) space and time, where n = x + y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task must be performed with little or no decompression. In this paper, we revisit the ACS problem under this paradigm where the input sequences are given in their run-length encoded format. We present an algorithm to compute ACS(X, Y) in O(N logN) time using O(N) space, where N is the total length of sequences after run-length encoding.
In the reverse complement equivalence model, it is not possible to distinguish a string from its reverse complement. We show that one can still reconstruct a string of length n, up to reverse complement, using a linea...
详细信息
In the reverse complement equivalence model, it is not possible to distinguish a string from its reverse complement. We show that one can still reconstruct a string of length n, up to reverse complement, using a linear number of subsequence queries of bounded length. We first give the proof for strings over a binary alphabet, and then extend it to arbitrary finite alphabets. A simple information theoretic lower bound proves the number of queries to be asymptotically tight. Furthermore, our result is optimal w.r.t. the bound on the query length given in Erdos et al. (2006) [6]. (C) 2011 Elsevier B.V. All rights reserved.
Range LCP ( longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string S[1...n] of n characters, such that whenever an interval [i, j] comes as a query, we can r...
详细信息
Range LCP ( longest common prefix) is an extension of the classical LCP problem and is defined as follows: Preprocess a string S[1...n] of n characters, such that whenever an interval [i, j] comes as a query, we can report max{vertical bar LCP(S-p, S-q) vertical bar vertical bar i <= p < q <= j} Here LCP(S-p, S-q) is the longest common prefix of the suffixes of S starting at locations p and q, and vertical bar LCP(S-p, S-q)j is its length. This problem was first addressed by Amir et al. [ISAAC, 2011]. They showed that the query can be answered in O(log log n) time using an O(n log(1+epsilon) n) space data structure for an arbitrarily small constant epsilon > 0. In an attempt to reduce the space bound, they presented a linear space data structure of O(d log log n) query time, where d = (j-i+1) In this paper, we present a new linear space data structure with an improved query time of O (root dlog d/(log n)(1/2-epsilon)).
There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that r...
详细信息
There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all right-maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of right-maximal Lyndon substrings of the input string. Among those presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on the iterated Duval algorithm for Lyndon factorization and a linear algorithmic scheme based on linear suffix sorting, computing the inverse suffix array, and applying to it the next smaller value algorithm. Duval's algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier's algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has O(nlog(n)) worst-case complexity, it is interesting as it emulates the fast Fourier algorithm's recursive approach and introduces tau-reduction, which might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier's algorithm for suffix sorting. This paper presents the theoretical analysis of these two algorithms and provides empirical comparisons of both of their C++ implementations with respect to the iterated Duval algorithm.
One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a...
详细信息
ISBN:
(纸本)9798350308600
One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu's parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm's performance.
We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) da...
详细信息
ISBN:
(纸本)9781728183169
We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns introducing, however, a number of spurious patterns that may harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is twofold. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.
暂无评论