We present an optimal O(log log n) time algorithm on the CRCW PRAM which tests whether a square array, A, of size n × n, is superprimitive. If A is not superprimitive, the algorithm returns the quasiperiod, i.e.,...
详细信息
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form Englis...
详细信息
ISBN:
(纸本)9780769549286;9781467350488
Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.
The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burr...
详细信息
ISBN:
(数字)9783031439803
ISBN:
(纸本)9783031439797;9783031439803
The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudo-alignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays. Our C++ implementations of these algorithms are available at https://***/jnalanko/kmer-lcs.
The problem of generalized function matching can be defined as follows: given a pattern p = p(1)... p(m) and a text t = t(1)...t(n), find a mapping f : Sigma(p) -> Sigma(t)* and all text locations i such that f(p(1...
详细信息
The problem of generalized function matching can be defined as follows: given a pattern p = p(1)... p(m) and a text t = t(1)...t(n), find a mapping f : Sigma(p) -> Sigma(t)* and all text locations i such that f(p(1))f(p(2)) ... f(p(m)) = t(i) ... t(1), a substring of t. By modifying the restrictions of the matching function f, one can obtain different matching problems, many of which have important applications. When f :Sigma(p) -> Sigma(t) we are faced with problems found in the well-established field of combinatorial pattern matching. If the single character constraint is lifted and f : Sigma(p) -> Sigma(t)*, we obtain generalized function matching as introduced by Amir and Nor (JDA 2007). If we further constrain f to be injective, then we arrive at generalized parametrized matching as defined by Clifford et al. (SPIRE 2009). There are a number of important applications for pattern matching in computational biology, text editors and data compression, to name a few. Therefore, many efficient algorithms have been developed for a wide variety of specific problems including finding tandem repeats in DNA sequences, optimizing embedded systems by reusing code etc. In this work we present a heuristic algorithm illustrating a practical approach to tackling a variant of generalized function matching where f : Sigma(p) -> Sigma(+)(t) and demonstrate its performance on human-produced text as well as random strings. (C) 2019 The Authors. Published by Elsevier B.V.
This paper provides a deterministic algorithm for finding a longest open reading frame (ORF) among all alternative splicings of a given DNA sequence. Finding protein encoding regions is a fundamental problem in genomi...
详细信息
ISBN:
(纸本)9781457716133
This paper provides a deterministic algorithm for finding a longest open reading frame (ORF) among all alternative splicings of a given DNA sequence. Finding protein encoding regions is a fundamental problem in genomic DNA sequence analysis and long ORFs generally provide good predictions of such regions. Although the number of splice variants is exponential in the number of optionally spliced regions, we are able to in many cases obtain quadratic or even linear performance. This efficiency is achieved by limiting the size of the search space for potential ORFs: by properly pruning the search space we can reduce the number of frames considered at any one time while guaranteeing that a longest open reading frame must be among the considered frames.
\Ne consider the classic problem of computing (the length of) the longest common subsequence,CS) between two strings A and 8 with lengths 7fl and n, respectively. Fherc are several input sensitive algorithms for this ...
详细信息
ISBN:
(纸本)9788001044032
\Ne consider the classic problem of computing (the length of) the longest common subsequence,CS) between two strings A and 8 with lengths 7fl and n, respectively. Fherc are several input sensitive algorithms for this problem, sucl i as the O(sigma n + min{L)}) algorithms by Rick [15] and Goeman and Clausen [5] and the O(sigma d. min{sigma d, Lm}) algorithms by Chin and Poon [4] and Rick [15]. Here L is the length of the LCS and d is the number of dominant matches between A and B, and a is the alphabet size. These algorithms require O(sigma n) time preprocessing for both A and B. We propose a new fairly simple O(sigma m + min{Lm, L(n - L)}) time algorithm that works in online manner: It needs to preprocess only A, and it can process B one character at a time, without knowing the whole string B beforehand. The algorithm also adapts well to the linear space' scheme of -Hirschberg;[6] for recovering the LCS, wh ich was not as easy with the above-mentioned algorithms. fiL addition, our scheme fits well into the context of incremental string comparison [2,10]. The original algorithm of Landau et al. [12] for this problem uses O(sigma m +Lm) space. By using our scheme instead, the space usage becomes O(sigma m + min {Lm, L(n - L)}).
Novel high throughput sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a muck lower cost than previous...
详细信息
ISBN:
(纸本)9788001044032
Novel high throughput sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a muck lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of degenerate and weighted sequences to a reference genome, based on whether they occur exactly once in the get rot ire or not, and by taking into consideration probability scores. in particular, we design parallel algorithms for Massive Exact and Approximate Unique Pattern Matching for degenerate and weighted sequences derived from high throughput sequencing technologies.
Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is...
详细信息
ISBN:
(纸本)9781450322591
Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary;in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-off between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.
Longest Increasing Subsequence (LIS) is a fundamental statistic of a sequence, and has been studied for decades. While the LIS of a sequence of length n can be computed exactly in time O(n log n), the complexity of es...
详细信息
ISBN:
(纸本)9781665455190
Longest Increasing Subsequence (LIS) is a fundamental statistic of a sequence, and has been studied for decades. While the LIS of a sequence of length n can be computed exactly in time O(n log n), the complexity of estimating the (length of the) LIS in sublinear time, especially when LIS << n, is still open. We show that for any n is an element of N and lambda = o(1), there exists a (randomized) non-adaptive algorithm that, given a sequence of length n with LIS >= lambda n, approximates the LIS up to a factor of 1/lambda(o(1)) in n(o(1))/lambda time. Our algorithm improves upon prior work substantially in terms of both approximation and run-time: (i) we provide the first sub-polynomial approximation for LIS in sub-linear time;and (ii) our run-time complexity essentially matches the trivial sample complexity lower bound of Omega(1/lambda), which is required to obtain any non-trivial approximation of the LIS. As part of our solution, we develop two novel ideas which may be of independent interest. First, we define a new Genuine-LIS problem, in which each sequence element may be either genuine or corrupted. In this model, the user receives unrestricted access to the actual sequence, but does not know a priori which elements are genuine. The goal is to estimate the LIS using genuine elements only, with the minimal number of tests for genuineness. The second idea, Precision Tree, enables accurate estimations for composition of general functions from "coarse" (sub-)estimates. Precision Tree essentially generalizes classical precision sampling, which works only for summations. As a central tool, the Precision Tree is pre-processed on a set of samples, which thereafter is repeatedly used by multiple components of the algorithm, improving their amortized complexity.
An extended special factor of a word x is a factor of x whose longest infix can be extended by at least two distinct letters to the left or to the right and still occur in x. It is called extended bispecial if it can ...
详细信息
ISBN:
(数字)9783030004798
ISBN:
(纸本)9783030004798;9783030004781
An extended special factor of a word x is a factor of x whose longest infix can be extended by at least two distinct letters to the left or to the right and still occur in x. It is called extended bispecial if it can be extended in both directions and still occur in x. Let rho(n) be the maximum number of extended bispecial factors over all words of length n. Almirantis et al. have shown that 2n - 6 <= rho(n) <= 3n - 4 [WABI 2017]. In this article, we show that there is no constant c < 3 such that rho(n) <= cn. We then exploit the connection between extended special factors and minimal absent words to construct a data structure for computing minimal absent words of a specific length in optimal time for integer alphabets generalising a result by Fujishige et al. [MFCS 2016]. As an application of our data structure, we show how to compare two words over an integer alphabet in optimal time improving on another result by Charalampopoulos et al. [Inf. Comput. 2018].
暂无评论