Suffix array (SA) construction is a time-and-memory bottleneck in many string processing applications. In this paper we improve the runtime of a small-space - semi-external - SA construction algorithm by Karkkainen (T...
详细信息
Suffix array (SA) construction is a time-and-memory bottleneck in many string processing applications. In this paper we improve the runtime of a small-space - semi-external - SA construction algorithm by Karkkainen (TCS, 2007) [5]. We achieve a speedup in practice of 2-4 times, without increasing memory usage. Our main contribution is a way to implement the "pointer copying" heuristic, used in less space-efficient SA construction algorithms, in a memory-efficient way. (C) 2013 Elsevier B.V. All rights reserved.
We present in this article a linear time and space method for the computation of the length of a repeated suffix for each prefix of a given word p . Our method is based on the utilization of the factor oracle of p whi...
详细信息
We present in this article a linear time and space method for the computation of the length of a repeated suffix for each prefix of a given word p . Our method is based on the utilization of the factor oracle of p which is a new and very compact structure introduced in [1], used for representing all the factors of p . We exhibit applications where our method really speeds up the computation of repetitions in words.
We present solutions for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols and a bound k, our algorithms find all the plac...
详细信息
We present solutions for the k-mismatch pattern matching problem with don't cares. Given a text t of length n and a pattern p of length m with don't care symbols and a bound k, our algorithms find all the places that the pattern matches the text with at most k mismatches. We first give a Theta (n(k+log m log k) log n) time randomised algorithm which finds the correct answer with high probability. We then present a new deterministic Theta(nk(2) log(2) M) time solution that uses tools originally developed for group testing. Taking our derandomisation approach further we develop an approach based on k-selectors that runs in Theta(nk polylog m) time. Further, in each case the location of the mismatches at each alignment is also given at no extra cost. (C) 2009 Elsevier Inc. All rights reserved.
We consider a class of pattern matching problems where a normalizing polynomial transformation can be applied at every alignment of the pattern and text. Normalized pattern matching plays a key role in fields as diver...
详细信息
We consider a class of pattern matching problems where a normalizing polynomial transformation can be applied at every alignment of the pattern and text. Normalized pattern matching plays a key role in fields as diverse as image processing and musical information processing, where application specific transformations are often applied to the input. By considering a wide range of such transformations, we provide fast algorithms and the first lower bounds for both new and old problems. Given a pattern of length m and a longer text of length n, where both are assumed to contain integer values only, we first show O(n log m) time algorithms for pattern matching under linear transformations even when wildcard symbols can occur in the input. We then show how to extend the technique to polynomial transformations of arbitrary degree. Next we consider the problem of finding the minimum Hamming distance under polynomial transformation. We show that, for any epsilon > 0, there cannot exist an O(nm(1-epsilon)) time algorithm for additive and linear transformations conditional on the hardness of the classic 3SUM problem. Finally, we consider a version of the Hamming distance problem under additive transformations with a bound k on the maximum distance that needs to be reported. We give a deterministic O(nk log k) time solution, which we then improve by careful use of randomization to O(n root k log k log n) time for sufficiently small k. Our randomized solution outputs the correct answer at every position with high probability.
We present an almost linear time method of inductive synthesis restoring simple regular expressions from one representative (good) example. In particular, we consider synthesis of expressions of star-height one, where...
详细信息
We present an almost linear time method of inductive synthesis restoring simple regular expressions from one representative (good) example. In particular, we consider synthesis of expressions of star-height one, where we allow one union operation under each iteration, and synthesis of expressions without union operations from examples that may contain mistakes. In both cases we provide sufficient conditions defining precisely the class of target expressions and the notion of good examples under which the synthesis algorithm works correctly, and present the proof of correctness. In the case of expressions with unions the proof is based on novel results in the combinatorics of words. A generalized algorithm that can synthesize simple expressions containing unions from noisy examples is implemented as a computer program. Computer experiments show that the algorithm is quite practical and may have applications in genome informatics.
We present a new sublinear-size index structure for finding all occurrences of a given q-gram in a text. Such a q-gram index is needed in many approximate pattern matching algorithms. All earlier q-gram indexes requir...
详细信息
We present a new sublinear-size index structure for finding all occurrences of a given q-gram in a text. Such a q-gram index is needed in many approximate pattern matching algorithms. All earlier q-gram indexes require at least O(n) space, where n is the length of the text. The new Lempel-Ziv index needs only O(n/log n) space while being as fast as previous methods. The new method takes advantage of repetitions in the text found by Lempel-Ziv parsing.
Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This w...
详细信息
Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathscr {D}$$\end{document} of d strings, each of length l\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell $$\end{document}, a query string q of length l\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell $$\end{document}, and a positive integer z, we are asked to compute a smallest set K subset of{1, horizontal ellipsis ,l}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K\subseteq \{1,\ldots ,\ell \}$$\end{document}, so that if q[i] is replaced by a wildcard for all i is an element of K\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i\in K$$\end{document}, then q matches at least z strings from D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usep
When searching for information on the Web, it is often necessary to use one of the available search engines. Because the number of results are quite large for most queries, we need some measure of relevance with respe...
详细信息
When searching for information on the Web, it is often necessary to use one of the available search engines. Because the number of results are quite large for most queries, we need some measure of relevance with respect to the query. One of the most important relevance factors is the proximity score, i.e., how close the keywords appear together in a given document. A basic proximity score is given by the size of the smallest range containing all the keywords in the query. We generalize the proximity score to include many practically important cases and present an O(n log k)-time algorithm for the generalized problem, where k is the number of keywords and n is the number of occurrences of the keywords in a document. (C) 2004 Elsevier B.V. All rights reserved.
Given a string Tof length nover an alphabet Sigma subset of{1, 2,..., n(O(1))} of size sigma, we are to preprocess Tso that given a range [i, j], we can return a representation of a shortest string over Sigma that is ...
详细信息
Given a string Tof length nover an alphabet Sigma subset of{1, 2,..., n(O(1))} of size sigma, we are to preprocess Tso that given a range [i, j], we can return a representation of a shortest string over Sigma that is absent in the fragment T[i] . . . T[ j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log(sigma) n) time. (C) 2022 Elsevier B.V. All rights reserved.
We consider a version of pattern matching useful in processing large musical data: delta-matching, which consists in finding matches which are delta-approximate in the sense of the distance measured as maximum differe...
详细信息
We consider a version of pattern matching useful in processing large musical data: delta-matching, which consists in finding matches which are delta-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as \a- b\. We also consider (delta, gamma)-matching, where gamma is a bound on the total sum of the differences. We first consider "occurrence heuristics" by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider "substring heuristics". We present delta-matching algorithms fast on the average providing that the pattern is "non-flat" and the alphabet interval is large. The pattern is "flat" if its structure does not vary substantially. The algorithms, named delta-BM1, delta-BM2 and delta-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only "occurrence heuristics" have been considered. Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use delta-versions of suffix tries and subword graphs. Surprisingly, in the context of delta-matching subword graphs appear to be superior compared with compact suffix trees.
暂无评论