In this article, we propose a novel indexing and querying method for trajectories constrained in a road network. We aim to provide efficient algorithms for various types of spatiotemporal queries that involve routing ...
详细信息
In this article, we propose a novel indexing and querying method for trajectories constrained in a road network. We aim to provide efficient algorithms for various types of spatiotemporal queries that involve routing in road networks, such as (1) finding moving objects that have traveled along a given path during a given time interval, (2) extracting all paths traveled after a given spatiotemporal context, and (3) enumerating all paths between two locations traveled during a certain time interval. Unlike the existing methods in spatial database research, we employ indexing techniques and algorithms from string processing. This idea is based on the fact that we can represent spatial paths as strings, because trajectories in a network are represented as sequences of road segment IDs. The proposed SNT-index (suffix-array-based network-constrained trajectory index) introduces two novel concepts to trajectory indexing. The first is FM-index, which is a compact in-memory data structure for pattern matching. The second is an inverse suffix array, which allows the FM-index to be integrated with the temporal information stored in a forest of B+-trees. Thanks to these concepts, we can reduce the number of B+-tree accesses required by the query processing algorithms to a constant number, something that cannot be achieved with existing methods. Although an FM-index is essentially a static index, we also propose a practical method of appending new data to the index. Finally, experiments show that our method can process the target queries for more than 1 million trajectories in a few tens of milliseconds, which is significantly faster than what the baseline algorithms can achieve without string algorithms.
Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements;(2) flexible sequence modeling, such as binding profiles of molecular sequences;or (3) the existence of...
详细信息
Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements;(2) flexible sequence modeling, such as binding profiles of molecular sequences;or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns;and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets;(2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a];(3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure;and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.
The divide-and-conquer framework, used extensively in classical algorithm design, recursively breaks a problem of size n into smaller subproblems (say, a copies of size n/b each), along with some auxiliary work of cos...
详细信息
The divide-and-conquer framework, used extensively in classical algorithm design, recursively breaks a problem of size n into smaller subproblems (say, a copies of size n/b each), along with some auxiliary work of cost Caux(n), to give a recurrence relation C(n) <= a C(n/b) + Caux(n) for the classical complexity C(n). We describe a quantum divide-and-conquer framework that, in certain cases, yields an analogous recurrence relation CQ(n) <= root aCQ(n/b) + O(Caux Q (n)) that characterizes the quantum query complexity. We apply this framework to obtain near-optimal quantum query complexities for various string problems, such as (i) recognizing the regular language Sigma & lowast;20 & lowast;2 Sigma & lowast;over the alphabet Sigma = {0, 1, 2};(ii) decision versions of string Rotation and string Suffix;and natural parameterized versions of (iii) Longest Increasing Subsequence and (iv) Longest Common Subsequence.
Longest common substring (LCS), longest palindrome substring (LPS), and Ulam distance (UL) are three fundamental string problems that can be classically solved in near linear time. In this work, we present sublinear t...
详细信息
Longest common substring (LCS), longest palindrome substring (LPS), and Ulam distance (UL) are three fundamental string problems that can be classically solved in near linear time. In this work, we present sublinear time quantum algorithms for these problems along with quantum lower bounds. Our results shed light on a very surprising fact: Although the classic solutions for LCS and LPS are almost identical (via suffix trees), their quantum computational complexities are different. While we give an exact o (root n) time algoritham for LPS, we prove that LCS needs at least time omega(sic) (n(2/3 )) even for 0/1 strings.
A string is said to be closed if its length is one, or if it has a non-empty factor that occurs both as a prefix and as a suffix of the string, but does not occur elsewhere. The notion of closed words was introduced b...
详细信息
The longest common subsequence (LCS) problem on a pair of strings is a classical problem in string algorithms. Its extension, the semilocal LCS problem, provides a more detailed comparison of the input strings, withou...
详细信息
ISBN:
(纸本)9781450390682
The longest common subsequence (LCS) problem on a pair of strings is a classical problem in string algorithms. Its extension, the semilocal LCS problem, provides a more detailed comparison of the input strings, without any increase in asymptotic running time. Several semi-local LCS algorithms have been proposed previously;however, to the best of our knowledge, none have yet been implemented. In this paper, we explore a new hybrid approach to the semi-local LCS problem. We also propose a novel bit-parallel LCS algorithm. In the experimental part of the paper, we present an implementation of several existing and new parallel LCS algorithms and evaluate their performance.
The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their inpu...
详细信息
The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer l, our mechanism selects the lexicographically smallest rotation in every length-l fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to l(i);and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-K similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index;we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-K similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster.
In this paper, we introduce new types of approximate palindromes called single-arm-gapped palindromes(shortly SAGPs). A SAGP contains a gap in either its left or right arm, which is in the form of either wgucu(R)w(R) ...
详细信息
In this paper, we introduce new types of approximate palindromes called single-arm-gapped palindromes(shortly SAGPs). A SAGP contains a gap in either its left or right arm, which is in the form of either wgucu(R)w(R) or wucu(R)gw(R), where w and u are non-empty strings, w(R) and u(R) are respectively the reversed strings of wand u, g is a string called a gap, and c is either a single character or the empty string. Here we call wu and u(R) w(R) the arm of the SAGP, and vertical bar uv vertical bar the length of the arm. We classify SAGPs into two groups: those which have ucu(R) as a maximal palindrome (type-1), and the others (type-2). We propose several algorithms to compute type-1 SAGPs with longest arms occurring in a given string, based on suffix arrays. Then, we propose a linear-time algorithm to compute all type-1 SAGPs with longest arms, based on suffix trees. Also, we show how to compute type-2 SAGPs with longest arms in linear time. We also perform some preliminary experiments to show practical performances of the proposed methods. (C) 2019 Elsevier B.V. All rights reserved.
We consider the problem of finding the repetitive structures of a given string x. The period u of the string x grasps the repetitiveness of x, since x is a prefix of a string constructed by concatenations of u. We gen...
详细信息
We consider the problem of finding the repetitive structures of a given string x. The period u of the string x grasps the repetitiveness of x, since x is a prefix of a string constructed by concatenations of u. We generalize the concept of repetitiveness as follows: A string w covers a string I if there is a superstring of x which is constructed by concatenations and superpositions of Lu. A substring w of x is called a seed of x if w covers x. we present an O (n log n)-time algorithm for finding all the seeds of a given string of length n.
An elastic-degenerate (ED) string is a sequence of infinite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pan genome. The ED string matching (EDSM) problem ...
详细信息
An elastic-degenerate (ED) string is a sequence of infinite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pan genome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of lengthmin an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O(nm omega-1)+O(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where omega denotes the matrix multiplication exponent and the O() notation suppresses poly-log factors. In the k-EDSM problem, the approximate version of EDSM, we are askedto report all pattern occurrences with at most k errors.k-EDSM can be solved inO(k2mG+kN)time, under edit distance, or O(kmG+kN)time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardiniet al., The or. Comput. Sci. 2020]. Unfortunately, G is only bounded byN, and soeven fork=1, the existing algorithms run in Omega(mN)time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved inO((nm2+N)logm)orO(nm3+N)time under edit distance. For the decision version of the problem, we present a faster O(nm2 root logm+ Nlog logm)-time algorithm. We also show that 1-EDSM can be solved in O(nm2+N log m)time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2drectangle stabbing or 2d range emptiness), which we show how to solve efficiently. Inorder to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022
暂无评论