The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2...
详细信息
ISBN:
(纸本)9783031206429;9783031206436
The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving the algorithm's sub-par real-world performance. There is a super-linear algorithm DSH which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. This paper is concerned with analysing the sorting principle used in GSACA and DSH and exploiting its properties in order to give an optimised linear-time algorithm. Our algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH.
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3)n - 5/3 three-way character comparisons. The best previous hot id....
详细信息
ISBN:
(纸本)9783642130724
It is shown how to compute the lexicographically maximum suffix of a string of n >= 2 characters over a totally ordered alphabet using at most (4/3)n - 5/3 three-way character comparisons. The best previous hot id. which has stood unchallenged for more than 25 years, is (3/2)n - O(1) comparisons. We also prove an interesting property of an algorithm for computing the maximum suffix both with respect to a total order < and with respect to its inverse order >.
The sparse suffix sorting problem is to sort b = o(n) arbitrary suffixes of a string of length n using o(n) words of space in addition to the string. We present an O(n) time Monte Carlo algorithm using O(b log b) spac...
详细信息
ISBN:
(纸本)9783939897651
The sparse suffix sorting problem is to sort b = o(n) arbitrary suffixes of a string of length n using o(n) words of space in addition to the string. We present an O(n) time Monte Carlo algorithm using O(b log b) space and an O(n log b) time Las Vegas algorithm using O(b) space. This is a significant improvement over the best prior solutions by Bille et al. (ICALP 2013): a Monte Carlo algorithm running in O(n log b) time and O(b(1+epsilon)) space or O(n log 2 b) time and O(b) space, and a Las Vegas algorithm running in O(n log(2) b + b(2) log b) time and O(b) space. All the above results are obtained with high probability not just in expectation.
In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum n...
详细信息
ISBN:
(纸本)9783642212536
In the present paper, we introduce and study the problem of computing, for any given finite set of words, a shuffle word with a minimum so-called scope coincidence degree. The scope coincidence degree is the maximum number of different symbols that parenthesise any position in the shuffle word. This problem is motivated by an application of a new automaton model and can be regarded as the problem of scheduling shared memory accesses of some parallel processes in a way that minimises the number of memory cells required. We investigate the complexity of this problem and show that it can be solved in polynomial time.
A new formal framework for stringology is proposed, which consists of a three-sorted logical theory S designed to capture the combinatorial reasoning about finite words. A witnessing theorem is proven which demonstrat...
详细信息
ISBN:
(纸本)9788001057872
A new formal framework for stringology is proposed, which consists of a three-sorted logical theory S designed to capture the combinatorial reasoning about finite words. A witnessing theorem is proven which demonstrates how to extract algorithms for constructing strings from their proofs of existence. Various other applications of the theory are shown. The long term goal of this line of research is to introduce the tools of Proof Complexity to the analysis of strings.
The Parikh vector of a string s over a finite ordered alphabet Sigma = {a1 , ...., a sigma} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p(1) , ...., p(sigma)), where p(i) = vertical bar{...
详细信息
ISBN:
(纸本)9788001044032
The Parikh vector of a string s over a finite ordered alphabet Sigma = {a1 , ...., a sigma} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p(1) , ...., p(sigma)), where p(i) = vertical bar{j vertical bar s(j) = a(i)}vertical bar. Parikh vector q occurs in s if s has a substring 1 with AI) = q. The problem of searching for a query q in a text s of length ti can be solved simply and optimally with a sliding window approach in O(n) time. We present two new algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm finds all occurrences of a given Parikh vector in a text (over a fixed alphaheL of size sigma >= 2) t-trid appeaN to have a uh lirii ar eXpected time COrriplexity. The second algorithm only decides whether a given Parikh vector appears in a binary text;it iteratively constructs a linear size data structure which then allows answering queries in constant time, for many queries even during the construction phase.
In this paper, we examine a number of methods for text processing, principally coming from computational biology, and examine in which manner they can apply to musical analysis. Then, we propose a number of modificati...
详细信息
ISBN:
(纸本)0769512844
In this paper, we examine a number of methods for text processing, principally coming from computational biology, and examine in which manner they can apply to musical analysis. Then, we propose a number of modifications that can be made to these methods to allow for a better application to musical analysis. To this end, we first examine the practices of musical analysis. We focus on the field of extraction of motives. A common approach to this problem is to consider repetitions : whenever some part of the musical text is repeated, it can be considered as a motive. Detecting motives can either be based on a perfect match, or on inexact matching. To this end, the concept of similarity will be introduced and analysed, and its meaning will be defined in the scope of musical analysis. We also deal with the problem of the representation (or encoding) of the musical text. The role of encoding, and its consequences on the application of algorithms will be investigated.
This work establishes several strong hardness results on the problem of finding an ordering on a string's alphabet that either minimizes or maximizes the number of factors in that string's Lyndon factorization...
详细信息
ISBN:
(纸本)9783959771801
This work establishes several strong hardness results on the problem of finding an ordering on a string's alphabet that either minimizes or maximizes the number of factors in that string's Lyndon factorization. In doing so, we demonstrate that these ordering problems are sufficiently complex to model a wide variety of ordering constraint satisfaction problems (OCSPs). Based on this, we prove that (i) the decision versions of both the minimization and maximization problems are NP-complete, (ii) for both the minimization and maximization problems there does not exist a constant approximation algorithm running in polynomial time under the Unique Game Conjecture and (iii) there does not exist an algorithm to solve the minimization problem in time poly(vertical bar T vertical bar) . 2 degrees((sigma log sigma)) for a string T over an alphabet of size sigma under the Exponential Time Hypothesis (essentially the brute force approach of trying every alphabet order is hard to improve significantly).
The Lempel-Ziv (LZ) 77 factorization of a string is a widely-used algorithmic tool that plays a central role in compression and indexing. For a length-n string over a linearly-sortable alphabet, e.g., Sigma = {1, . . ...
详细信息
ISBN:
(数字)9783031439803
ISBN:
(纸本)9783031439797;9783031439803
The Lempel-Ziv (LZ) 77 factorization of a string is a widely-used algorithmic tool that plays a central role in compression and indexing. For a length-n string over a linearly-sortable alphabet, e.g., Sigma = {1, . . . , sigma} with sigma = n(O(1)), it can be computed in O(n) time. It is unknown whether this time can be achieved for the rightmost LZ parsing, where each referencing phrase points to its rightmost previous occurrence. The currently best solution takes O(n(1 + log sigma/root log n)) time (Belazzougui & Puglisi SODA2016). We show that this problem is much easier to solve for the LZ-End factorization (Kreft & Navarro DCC2010), where the rightmost factorization can be obtained in O(n) time for the greedy parsing (with phrases of maximal length), and in O(n + z root log z) time for any LZ-End parsing of z phrases. We also make advances towards a linear time solution for the general case. We show how to solve multiple non-trivial subsets of the phrases of any LZ-like parsing in O(n) time. As a prime example, we can find the rightmost occurrence of all phrases of length Omega(log(6.66) n / log(2) sigma) in O(n / log(sigma) n) time and space.
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form Englis...
详细信息
ISBN:
(纸本)9780769549286;9781467350488
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.
暂无评论