in this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the orde...
详细信息
in this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the order assigned to the alphabet symbols, we first consider the case in which the order can be chosen in an arbitrary way. We show that looking for an order that minimizes the number of local maxima in the given text string is an NP-hard problem. Then, we consider the case in which the order is fixed a priori. Even though the order is not necessarily optimal, we can exploit the property that the average number of local maxima induced by the order in an arbitrary text is approximately one third of the text length. In particular, we describe how to iterate the process of selecting the local maxima by one or more iterations, so as to obtain a sparsified text. We show how to use this technique to filter the access to unstructured texts, which appear to have no natural division in words. Finally, we experimentally show that our approach can be successfully used in order to create a space efficient index for searching sufficiently long patterns in a DNA sequence as quickly as a full index. (C) 2003 Elsevier B.V. All rights reserved.
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theor...
详细信息
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.
One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome seque...
详细信息
One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) in a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these sequences to a reference genome are in great demand, and recently, there has been ample work for publishing such algorithms. One important feature of these algorithms is the support of multithreaded parallel computing in order to speedup the mapping process. In this paper, we design parallel algorithms, which make use of the message-passing parallelism model, to address this problem efficiently. The proposed algorithms also take into consideration the probability scores assigned to each base for occurring in a specific position of a sequence. In particular, we present parallel algorithms for mapping short degenerate and weighted DNA sequences to a reference genome.
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such ...
详细信息
Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extension. Let D = {T-1, T-2,...,T-D} be a collection of D string documents of n characters in total, and P+ and P- be two query patterns, where P+ is a proper prefix of P-. We call P- as the forbidden extension of the included pattern P+ A forbidden extension query (P+, P-) asks to report all occ documents in D that contains P+ as a substring, but does not contain P- as one. A top-k forbidden extension query (P+, P-, k) asks to report those k documents among the occ documents that are most relevant to P+, where each document is given a unique fixed score (PageRank) and the relevance of a document is determined based on its score. We present a linear index (in words) with an O (vertical bar P-vertical bar + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following space-time trade-offs: center dot O(n) space (in words) and O ((vertical bar P-vertical bar log sigma + k) query time. center dot vertical bar CSA vertical bar + vertical bar CSA*vertical bar + D log n/D + O (n) bits and O (search(P-) + k . t(S)(A) center dot log(2+is an element of) n) query time, where is an element of > 0 is an arbitrarily small constant. center dot vertical bar CSA vertical bar + O (n log D) bits and O (search(P-) + (k + log D) log D) query time. Here sigma is the size of the alphabet set, CSA (of size vertical bar CSA vertical bar bits) is the compressed suffix array (CSA) of the concatenated text of all documents, CSA(d) is the CSA of T-d and vertical bar CSA*vertical bar = Sigma(D)(d=1) vertical bar CSA(d)vertical bar. Also, search(P-) is the time for
Given a string P of length m over an alphabet Sigma of size sigma, a swapped version of P is a string derived from P by a series of local swaps, i.e., swaps of adjacent symbols, such that each symbol can participate i...
详细信息
Given a string P of length m over an alphabet Sigma of size sigma, a swapped version of P is a string derived from P by a series of local swaps, i.e., swaps of adjacent symbols, such that each symbol can participate in at most one swap. We present a theoretical analysis of the nondeterministic finite automaton for the language boolean OR(P'is an element of Pi P) Sigma*P' (swap automaton, for short), where Pi(P) is the set of swapped versions of P. Our study is based on the bit-parallel simulation of the same automaton due to Fredriksson, and reveals an interesting combinatorial property that links the automaton to the one for the language Sigma*P By exploiting this property and the method presented by Cantone et al. (2012), we obtain a bit-parallel encoding of the swap automaton which takes O(sigma(2)[k/w]) space and allows one to simulate the automaton on a string of length n in time O(n[k/w]), where [m/sigma] <= k <= m is the size of a specific factorization of P defined by Cantone et al. (2012) and w is the word size in bits. (C) 2014 Elsevier B.V. All rights reserved.
The guided tree edit distance problem is to find a minimum cost series of edit operations that transforms two input forests F and G into isomorphic forests F' and G' such that a third input forest H is include...
详细信息
The guided tree edit distance problem is to find a minimum cost series of edit operations that transforms two input forests F and G into isomorphic forests F' and G' such that a third input forest H is included in F' (and G'). The edit operations are relabeling a vertex and deleting a vertex. We show efficient algorithms for this problem that are faster than the previous algorithm for this problem of Peng and Ting [Z. Peng, H. Ting, Guided forest edit distance: Better structure comparisons by using domain-knowledge, in: Proc. 18th Symposium on Combinatorial Pattern Matching (CPM), 2007, pp. 28-39]. (C) 2008 Elsevier B.V. All rights reserved.
The longest common substring with k-mismatches problem is to find, given two strings S-1 and S-2, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) is <= k. We int...
详细信息
The longest common substring with k-mismatches problem is to find, given two strings S-1 and S-2, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) is <= k. We introduce a practical O(nm) time and O(1) space solution for this problem, where n and m are the lengths of S-1 and S-2, respectively. This algorithm can also be used to compute the matching statistics with k-mismatches of S-1 and S-2 in O (nm) time and O(m) space. Moreover, we also present a theoretical solution for the k = 1 case which runs in O(n log m) time, assuming m <= n, and uses O(m) space, improving over the existing O(nm) time and O(m) space bound of Babenko and Starikovskaya [1]. (c) 2015 The Authors. Published by Elsevier B.V.
Given a pattern P and a text T, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of P occurs in T. The indexed version of this problem, i.e., ...
详细信息
Given a pattern P and a text T, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of P occurs in T. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are 0 (n(2)/log(2) n) (with O(n) space and O(1) query time) (Moosa and Rahman (2012) [1]) and O(r(2)logr) (with O(vertical bar L vertical bar space and O(log vertical bar L vertical bar) query time) (Badkobeh et al. (2012) [2]), where r is the length of the run-length encoding of T and vertical bar L vertical bar = O(n) is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. (2012) [2] that obtains a trade-off between the space and the time complexity. It has O (r(2) logk + n/k) complexity to build the index, O(logk) query time, and uses O(n/k + vertical bar L vertical bar) space, where k is a parameter. The second result is an O(n(2) log(2) w/w) algorithm (with O(n) space and O(1) query time), based on word-level parallelism where w is the word size in bits. (C) 2013 Elsevier B.V. All rights reserved.
The Longest Common Subsequence (LCS) of two strings A, B is a well studied problem having a wide range of applications. When each symbol of the input strings is assigned a positive weight the problem becomes the Heavi...
详细信息
The Longest Common Subsequence (LCS) of two strings A, B is a well studied problem having a wide range of applications. When each symbol of the input strings is assigned a positive weight the problem becomes the Heaviest Common Subsequence (HCS) problem. In this paper we consider a different version of weighted LCS on Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of LCS on PWM. For the first, we solve the LCS problem of z sequences in time O(zn(z+ 1)). For the second, we consider the log-probability version of the problem, prove NP-hardness and provide an approximation algorithm. (C) 2010 Elsevier B.V. All rights reserved.
For two strings a, b of lengths m, n, respectively, the longest common subsequence (LCS) problem consists in comparing a and b by computing the length of their LCS. In this paper, we define a generalisation, called &q...
详细信息
For two strings a, b of lengths m, n, respectively, the longest common subsequence (LCS) problem consists in comparing a and b by computing the length of their LCS. In this paper, we define a generalisation, called "the all semi-local LCS problem", where each string is compared against all substrings of the other string, and all prefixes of each string are compared against all suffixes of the other string. An explicit representation of the output lengths is of size Theta ((m + n)(2)). We show that the output can be represented implicitly by a geometric data structure of size O(m + n), allowing efficient queries of the individual output lengths. The currently best all string-substring LCS algorithm by Alves et al., based on previous work by Schmidt, can be adapted to produce the output in this form. We also develop the first all semi-local LCS algorithm, running in time o(mn) when m and n are reasonably close. Compared to a number of previous results, our approach presents an improvement in algorithm functionality, output representation efficiency, and/or running time. (C) 2008 Elsevier B.V. All rights reserved.
暂无评论