检索结果-内蒙古大学图书馆

Text sparsification via local maxima

THEORETICAL COMPUTER SCIENCE 2003年第1-3期304卷 341-364页

作者： Crescenzi, P Del Lungo, A Grossi, R Lodi, E Pagli, L Rossi, G Univ Siena Dipartimento Matemat I-53100 Siena Italy Univ Florence Dipartimento Sistemi & Informat I-50134 Florence Italy Univ Pisa Dipartimento Informat I-56125 Pisa Italy Univ Roma Tor Vergata Dipartimento Matemat I-00133 Rome Italy

in this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the order assigned to the alphabet symbols, we first consider the case in which the order can be chosen in an arbitrary way. We show that looking for an order that minimizes the number of local maxima in the given text string is an NP-hard problem. Then, we consider the case in which the order is fixed a priori. Even though the order is not necessarily optimal, we can exploit the property that the average number of local maxima induced by the order in an arbitrary text is approximately one third of the text length. In particular, we describe how to iterate the process of selecting the local maxima by one or more iterations, so as to obtain a sparsified text. We show how to use this technique to filter the access to unstructured texts, which appear to have no natural division in words. Finally, we experimentally show that our approach can be successfully used in order to create a space efficient index for searching sufficiently long patterns in a DNA sequence as quickly as a full index. (C) 2003 Elsevier B.V. All rights reserved.

关键词： computational complexity NP-completeness pattern matching string algorithms text indexing data structures

来源：评论

学校读者我要写书评

暂无评论

Efficient algorithms for Sequence Analysis with Entropic Profiles

引用

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018年第1期15卷 117-128页

作者： Pizzi, Cinzia Ornamenti, Mattia Spangaro, Simone Rombo, Simona E. Parida, Laxmi Univ Padua Dept Informat Engn Via Gradenigo 6-A I-35131 Padua Italy Univ Palermo Dept Math & Comp Sci Via Archirafi 34 I-90123 Palermo Italy IBM TJ Watson Res Ctr Yorktown Hts NY 10598 USA

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.

关键词： string algorithms sequence analysis sequence comparison suffix tree suffix array entropy alignment free sequence composition

来源：评论

学校读者我要写书评

暂无评论

PARALLEL algorithms FOR MAPPING SHORT DEGENERATE AND WEIGHTED DNA SEQUENCES TO A REFERENCE GENOME

引用

INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE 2012年第2期23卷 249-259页

作者： Iliopoulos, Costas S. Miller, Mirka Pissis, Solon P. Kings Coll London Dept Comp Sci London WC2R 2LS England Curtin Univ Digital Ecosyst & Business Intelligence Inst Perth WA 6845 Australia Univ Newcastle Sch Elect Engn & Comp Sci Callaghan NSW 2308 Australia Univ W Bohemia Dept Math Plzen Czech Republic

One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) in a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these sequences to a reference genome are in great demand, and recently, there has been ample work for publishing such algorithms. One important feature of these algorithms is the support of multithreaded parallel computing in order to speedup the mapping process. In this paper, we design parallel algorithms, which make use of the message-passing parallelism model, to address this problem efficiently. The proposed algorithms also take into consideration the probability scores assigned to each base for occurring in a specific position of a sequence. In particular, we present parallel algorithms for mapping short degenerate and weighted DNA sequences to a reference genome.

关键词： Parallel algorithms string algorithms next-generation sequencing

来源：评论

学校读者我要写书评

暂无评论

Space-efficient indexes for forbidden extension queries

引用

JOURNAL OF DISCRETE algorithms 2018年 50卷 23-35页

作者： Biswas, Sudip Ganguly, Arnab Shah, Rahul Thankachan, Sharma, V Facebook Inc Menlo Pk CA USA Univ Wisconsin Whitewater Dept Comp Sci Whitewater WI USA Louisiana State Univ Dept Comp Sci Baton Rouge LA 70803 USA Univ Cent Florida Dept Comp Sci 117 Harris CtrBldg 1164000 Cent Florida Blvd Orlando FL 32816 USA

Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extension. Let D = {T-1, T-2,...,T-D} be a collection of D string documents of n characters in total, and P+ and P- be two query patterns, where P+ is a proper prefix of P-. We call P- as the forbidden extension of the included pattern P+ A forbidden extension query (P+, P-) asks to report all occ documents in D that contains P+ as a substring, but does not contain P- as one. A top-k forbidden extension query (P+, P-, k) asks to report those k documents among the occ documents that are most relevant to P+, where each document is given a unique fixed score (PageRank) and the relevance of a document is determined based on its score. We present a linear index (in words) with an O (vertical bar P-vertical bar + occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following space-time trade-offs: center dot O(n) space (in words) and O ((vertical bar P-vertical bar log sigma + k) query time. center dot vertical bar CSA vertical bar + vertical bar CSA*vertical bar + D log n/D + O (n) bits and O (search(P-) + k . t(S)(A) center dot log(2+is an element of) n) query time, where is an element of > 0 is an arbitrarily small constant. center dot vertical bar CSA vertical bar + O (n log D) bits and O (search(P-) + (k + log D) log D) query time. Here sigma is the size of the alphabet set, CSA (of size vertical bar CSA vertical bar bits) is the compressed suffix array (CSA) of the concatenated text of all documents, CSA(d) is the CSA of T-d and vertical bar CSA*vertical bar = Sigma(D)(d=1) vertical bar CSA(d)vertical bar. Also, search(P-) is the time for

关键词： Suffix trees Data structures string algorithms Top-k Document retrieval

来源：评论

学校读者我要写书评

暂无评论

On a compact encoding of the swap automaton

引用

INFORMATION PROCESSING LETTERS 2014年第7期114卷 392-396页

作者： Fredriksson, Kimmo Giaquinta, Emanuele Univ Eastern Finland Sch Comp FI-70211 Kuopio Finland Univ Helsinki Dept Comp Sci FIN-00014 Helsinki Finland

Given a string P of length m over an alphabet Sigma of size sigma, a swapped version of P is a string derived from P by a series of local swaps, i.e., swaps of adjacent symbols, such that each symbol can participate in at most one swap. We present a theoretical analysis of the nondeterministic finite automaton for the language boolean OR(P'is an element of Pi P) Sigma*P' (swap automaton, for short), where Pi(P) is the set of swapped versions of P. Our study is based on the bit-parallel simulation of the same automaton due to Fredriksson, and reveals an interesting combinatorial property that links the automaton to the one for the language Sigma*P By exploiting this property and the method presented by Cantone et al. (2012), we obtain a bit-parallel encoding of the swap automaton which takes O(sigma(2)[k/w]) space and allows one to simulate the automaton on a string of length n in time O(n[k/w]), where [m/sigma] <= k <= m is the size of a specific factorization of P defined by Cantone et al. (2012) and w is the word size in bits. (C) 2014 Elsevier B.V. All rights reserved.

关键词： Combinatorial problems string algorithms Automata theory Word-level parallelism

来源：评论

学校读者我要写书评

暂无评论

Faster algorithms for guided tree edit distance

引用

INFORMATION PROCESSING LETTERS 2008年第4期108卷 251-254页

作者： Tsur, Dekel Ben Gurion Univ Negev Dept Comp Sci IL-84105 Beer Sheva Israel

The guided tree edit distance problem is to find a minimum cost series of edit operations that transforms two input forests F and G into isomorphic forests F' and G' such that a third input forest H is included in F' (and G'). The edit operations are relabeling a vertex and deleting a vertex. We show efficient algorithms for this problem that are faster than the previous algorithm for this problem of Peng and Ting [Z. Peng, H. Ting, Guided forest edit distance: Better structure comparisons by using domain-knowledge, in: Proc. 18th Symposium on Combinatorial Pattern Matching (CPM), 2007, pp. 28-39]. (C) 2008 Elsevier B.V. All rights reserved.

关键词： Design of algorithms string algorithms Edit distance Ordered trees

来源：评论

学校读者我要写书评

暂无评论

Longest common substrings with k mismatches

引用

INFORMATION PROCESSING LETTERS 2015年第6-8期115卷 643-647页

作者： Flouri, Tomas Giaquinta, Emanuele Kobert, Kassian Ukkonen, Esko Heidelberg Inst Theoret Studies Heidelberg Germany Aalto Univ Dept Comp Sci Espoo Finland Univ Helsinki Dept Comp Sci FIN-00014 Helsinki Finland

The longest common substring with k-mismatches problem is to find, given two strings S-1 and S-2, a longest substring A(1) of S-1 and A(2) of S-2 such that the Hamming distance between A(1) and A(2) is <= k. We introduce a practical O(nm) time and O(1) space solution for this problem, where n and m are the lengths of S-1 and S-2, respectively. This algorithm can also be used to compute the matching statistics with k-mismatches of S-1 and S-2 in O (nm) time and O(m) space. Moreover, we also present a theoretical solution for the k = 1 case which runs in O(n log m) time, assuming m <= n, and uses O(m) space, improving over the existing O(nm) time and O(m) space bound of Babenko and Starikovskaya [1]. (c) 2015 The Authors. Published by Elsevier B.V.

关键词： Combinatorial problems string algorithms Hamming distance Longest common substring

来源：评论

学校读者我要写书评

暂无评论

New algorithms for binary jumbled pattern matching

引用

INFORMATION PROCESSING LETTERS 2013年第14-16期113卷 538-542页

作者： Giaquinta, Emanuele Grabowski, Szymon Univ Helsinki Dept Comp Sci FIN-00014 Helsinki Finland Lodz Univ Technol Inst Appl Comp Sci PL-90924 Lodz Poland

Given a pattern P and a text T, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of P occurs in T. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are 0 (n(2)/log(2) n) (with O(n) space and O(1) query time) (Moosa and Rahman (2012) [1]) and O(r(2)logr) (with O(vertical bar L vertical bar space and O(log vertical bar L vertical bar) query time) (Badkobeh et al. (2012) [2]), where r is the length of the run-length encoding of T and vertical bar L vertical bar = O(n) is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. (2012) [2] that obtains a trade-off between the space and the time complexity. It has O (r(2) logk + n/k) complexity to build the index, O(logk) query time, and uses O(n/k + vertical bar L vertical bar) space, where k is a parameter. The second result is an O(n(2) log(2) w/w) algorithm (with O(n) space and O(1) query time), based on word-level parallelism where w is the word size in bits. (C) 2013 Elsevier B.V. All rights reserved.

关键词： Combinatorial problems string algorithms Jumbled pattern matching Parikh vectors Run-length encoding Word-level parallelism

来源：评论

学校读者我要写书评

暂无评论

Weighted LCS

引用

JOURNAL OF DISCRETE algorithms 2010年第3期8卷 273-281页

作者： Amir, Amihood Gotthilf, Zvi Shalom, B. Riva Bar Ilan Univ Dept Comp Sci IL-52900 Ramat Gan Israel Johns Hopkins Univ Dept Comp Sci Baltimore MD 21218 USA

The Longest Common Subsequence (LCS) of two strings A, B is a well studied problem having a wide range of applications. When each symbol of the input strings is assigned a positive weight the problem becomes the Heaviest Common Subsequence (HCS) problem. In this paper we consider a different version of weighted LCS on Position Weight Matrices (PWM). The Position Weight Matrix was introduced as a tool to handle a set of sequences that are not identical, yet, have many local similarities. Such a weighted sequence is a 'statistical image' of this set where we are given the probability of every symbol's occurrence at every text location. We consider two possible definitions of LCS on PWM. For the first, we solve the LCS problem of z sequences in time O(zn(z+ 1)). For the second, we consider the log-probability version of the problem, prove NP-hardness and provide an approximation algorithm. (C) 2010 Elsevier B.V. All rights reserved.

关键词： string algorithms Approximation algorithms NP-hard problem Position weight matrix

来源：评论

学校读者我要写书评

暂无评论

Semi-local longest common subsequences in subquadratic time

引用

JOURNAL OF DISCRETE algorithms 2008年第4期6卷 570-581页

作者： Tiskin, Alexander Univ Warwick Dept Comp Sci Coventry CV4 7AL W Midlands England

For two strings a, b of lengths m, n, respectively, the longest common subsequence (LCS) problem consists in comparing a and b by computing the length of their LCS. In this paper, we define a generalisation, called "the all semi-local LCS problem", where each string is compared against all substrings of the other string, and all prefixes of each string are compared against all suffixes of the other string. An explicit representation of the output lengths is of size Theta ((m + n)(2)). We show that the output can be represented implicitly by a geometric data structure of size O(m + n), allowing efficient queries of the individual output lengths. The currently best all string-substring LCS algorithm by Alves et al., based on previous work by Schmidt, can be adapted to produce the output in this form. We also develop the first all semi-local LCS algorithm, running in time o(mn) when m and n are reasonably close. Compared to a number of previous results, our approach presents an improvement in algorithm functionality, output representation efficiency, and/or running time. (C) 2008 Elsevier B.V. All rights reserved.

关键词： string algorithms Longest common subsequence Semi-local string comparison

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：