In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum n...
详细信息
In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum number of different symbols that parenthesize any position in the shuffle word. This problem is motivated by an application of a new automaton model and can be regarded as the problem of scheduling shared memory accesses of some parallel processes in a way that minimizes the number of memory cells required. We investigate the complexity of this problem and show that it can be solved in polynomial time.
Let T-1 and T-2 be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in T-2 is a permutation of those in T-1. Nodes are associated with weight, such that the weigh...
详细信息
Let T-1 and T-2 be two rooted trees with an equal number of leaves. The leaves are labeled, and the labeling of the leaves in T-2 is a permutation of those in T-1. Nodes are associated with weight, such that the weight of a node u, denoted by W(u), is more than the weight of its parent. A node x is an element of T-1 and a node y is an element of T2 are induced, if their subtrees have at least one common leaf label. A heaviest induced ancestor query HIA(mu(1), mu(2)) with input nodes u(1) is an element of T-1 and u(2) is an element of T-2 asks to output the pair (u(1)*, u(2)*) up of induced nodes with the highest combined weight W(u(1)*) + W(u(2)*) such that u(1)* is an ancestor of u(1) and u(2)* is an ancestor of u(2). This is a useful primitive in several text processing applications. Gagie et al. (Proceedings of the 25th Canadian Conference on Computational Geometry, CCCG 2013, Waterloo, Ontario, Canada, 2013) introduced this problem and proposed three data structures with the following space-time tradeoffs: (i) O(n log(2) n) space and O(log n log log n) query time, (ii) O(n log n) space and O(log(2) n) query time, and (iii) O(n) space and O(log(3+epsilon) n) query time. Here n is the number of nodes in both trees combined and epsilon > O is an arbitrarily small constant. We present two new data structures with better space-time trade-offs: (i) O(n log n) space and O(log n log log n) query time, and (ii) O(n) space and O (log e n /log log n) query time. Additionally, we present new applications of these results.
We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals [i, j] where each term is non-zero only...
详细信息
We consider Conditional random fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) is the sum of terms over intervals [i, j] where each term is non-zero only if the substring equals a prespecified pattern w. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively , and where L is the combined length of input patterns, is the maximum length of a pattern, and D is the input alphabet. This improves on the previous algorithms of Ye et al. (NIPS, 2009) whose complexities are respectively , and , where is the number of input patterns. In addition, we give an efficient algorithm for sampling, and revisit the case of MAP with non-positive weights.
We study a simple algorithm generating square-free words from a random source. The source produces uniformly distributed random letters from a k-ary alphabet, and the algorithm outputs a (k+1)-ary square-free word. We...
详细信息
We study a simple algorithm generating square-free words from a random source. The source produces uniformly distributed random letters from a k-ary alphabet, and the algorithm outputs a (k+1)-ary square-free word. We are interested in the "conversion ratio" between the lengths of the input random word and the output square-free word. For any k >= 3 we prove the expected value of this ratio to be a constant and calculate it up to an O(1/k(5)) term. For the extremal case of ternary square-free words, we suggest this ratio to have a constant expectation as well and conjecture its actual value from computer experiments. (C) 2015 Elsevier B.V. All rights reserved.
Inspired by scaffold filling, a recent approach for genome reconstruction from incomplete data, we consider a variant of the well-known longest common subsequence problem for the comparison of two sequences. The new p...
详细信息
Inspired by scaffold filling, a recent approach for genome reconstruction from incomplete data, we consider a variant of the well-known longest common subsequence problem for the comparison of two sequences. The new problem, called Longest Filled Common Subsequence, aims to compare a complete sequence with an incomplete one, i.e. with some missing elements. Longest Filled Common Subsequence (LFCS), given a complete sequence A, an incomplete sequence B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. We investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol, and we give a polynomial time algorithm when the input sequences are over a constant-size alphabet. We give a 3/5-approximation algorithm for the Longest Filled Common Subsequence problem. Finally, we present a fixed-parameter algorithm for the problem, when it is parameterized by the number of symbols inserted in B that "match" symbols of A. (C) 2019 Elsevier B.V. All rights reserved.
We study the Submass Finding Problem: given a string s over a weighted alphabet, i.e., an alphabet Sigma with a weight function mu : Sigma -> N, we refer to a mass M is an element of N as a submass of s if s has a ...
详细信息
We study the Submass Finding Problem: given a string s over a weighted alphabet, i.e., an alphabet Sigma with a weight function mu : Sigma -> N, we refer to a mass M is an element of N as a submass of s if s has a substring whose weights sum up to M. Now, for a set of input masses {M-l, ..., M-k}, we want to find those M-i which are submasses of s, and return one or all occurrences of substrings with mass Mi. We present efficient algorithms for both the decision and the search problem. Furthermore, our approach allows us to compute efficiently the number of different submasses of s. The main idea of our algorithms is to define appropriate polynomials such that we can determine the solution for the Submass Finding Problem from the coefficients of the product of these polynomials. We obtain very efficient running times by using Fast Fourier Transform to compute this product. Our main algorithm for the decision problem runs in time O(mu(s) log mu(s)), where mu(s) is the total mass of string s. Employing methods for compressing sparse polynomials, this runtime can be viewed as O(sigma(s) log(2) sigma(s)). where sigma(s) denotes the number of different submasses of s. In this case, the runtime is independent of the size of the individual masses of characters. (c) 2006 Elsevier B.V. All rights reserved.
作者:
Lefebvre, ALecroq, TUniv Rouen
Fac Sci & Tech CNRS UMR 6037ABISS F-76821 Mont St Aignan France Univ Rouen
Fac Sci & Tech ABISS LIFAR F-76821 Mont St Aignan France
We present in this article a linear time and space data compression method. This method, based on a factor oracle and the computation of the length of repeated suffixes, is easy to implement, fast and gives good compr...
详细信息
We present in this article a linear time and space data compression method. This method, based on a factor oracle and the computation of the length of repeated suffixes, is easy to implement, fast and gives good compression ratios. (C) 2001 Elsevier Science B.V All rights reserved.
Suffix trees are inherently asymmetric: prefix extensions only cause a few updates, while suffix extensions affect all suffixes causing a wave of updates. In his elegant linear-time on-line suffix tree algorithm Ukkon...
详细信息
Suffix trees are inherently asymmetric: prefix extensions only cause a few updates, while suffix extensions affect all suffixes causing a wave of updates. In his elegant linear-time on-line suffix tree algorithm Ukkonen relaxed the prevailing suffix tree representation and introduced two changes to avoid repeated structural updates and circumvent the inherent complexity of suffix extensions: (1) open ended edges that enjoy gratuitous leaf updates, and (2) the omission of implicit nodes. In this paper we study the implicit nodes as the suffix tree evolves. We partition the suffix tree's edges into collections of similar edges called bands, where implicit nodes exhibit identical behavior, and generalize the notion of open ended edges to allow implicit nodes to "float" within bands, only requiring updates when moving from one band to the next, adding up to only O(n) updates. We also show that internal implicit nodes are separated from each other by explicit suffix tree nodes and that all external implicit nodes are related to the same periodicity. These new properties may be used to keep track of the waves of implicit node updates and to build the suffix tree on-line in amortized linear time, providing access to all the implicit nodes in worst-case constant time. (C) 2012 Elsevier B.V. All rights reserved.
y Range minimum query is an important building brick of many compressed data structures and string matching algorithms. Although this problem is essentially solved in theory, with sophisticated data structures allowin...
详细信息
y Range minimum query is an important building brick of many compressed data structures and string matching algorithms. Although this problem is essentially solved in theory, with sophisticated data structures allowing for constant time queries, practical performance and construction time also matter. Additionally, there are offline scenarios in which the number of queries, ie, q, is rather small and given beforehand, which encourages to use a simpler approach. In this work, we present a simple data structure, with very fast construction, which allows to handle queries in constant time on average. This algorithm, however, requires access to the input data during queries (which is not the case of sophisticated range minimum query solutions). We subsequently refine our technique, combining it with one of the existing succinct solutions with O(1) worst-case time queries and no access to the input array. The resulting hybrid is still a memory frugal data structure, spending usually up to about 3n bits and providing competitive query times, especially for wide ranges. We also show how to make our baseline data structure more compact. Experimental results demonstrate that the proposed block-based sparse table (BbST) variants are competitive to existing solutions, also in the offline scenario.
The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2...
详细信息
The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA's non-competitive real-world performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call the algorithm "generic," since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all other programs on data that is not extremely repetitive.
暂无评论