检索结果-内蒙古大学图书馆

Tunnel: Parallel-inducing sort for large string analytics

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2023年第1期149卷 650-663页

作者： Du, Zhihui Zhang, Sen Bader, David A. New Jersey Inst Technol Newark NJ 07102 USA State Univ New York Oneonta Oneonta NY 13820 USA

The suffix array is a crucial data structure for efficient string analysis. Over the course of twenty-six years, sequential suffix array construction algorithms have achieved O(n) time complexity and inplace sorting. In this paper, we present the Tunnel algorithm, the first large-scale parallel suffix array (n ) construction algorithm with a time complexity of O based on the parallel random access machine (PRAM) model. The Tunnel algorithm is built on three key ideas: dividing the problem of size O(n) into p sub-problems of reduced size O by replacing long suffixes with shorter prefixes of size at most a constant D;introducing a Tunnel mechanism to efficiently induce the order of a set of suffixes with long common prefixes;developing a strategy to transform a partially ordered suffix set into a total order relation by iteratively applying the Tunnel inducing method. We provide a detailed description of the algorithm, along with a thorough analysis of its time and space complexity, to demonstrate its correctness and efficiency. The proposed Tunnel algorithm exhibits scalable performance, making it suitable for large string analytics on large-scale parallel systems. & COPY;2023 Elsevier B.V. All rights reserved.

关键词： Suffix array string algorithm Parallel sorting string analysis

来源：评论

学校读者我要写书评

暂无评论

Solving string Problems on Graphs Using the Labeled Direct Product

引用

algorithmICA 2022年第10期84卷 3008-3033页

作者： Rizzo, Nicola Tomescu, Alexandru, I Policriti, Alberto Univ Helsinki Dept Comp Sci Helsinki Finland Univ Udine Dept Math Comp Sci & Phys Udine Italy

Suffix trees are an important data structure at the core of optimal solutions to many fundamental string problems, such as exact pattern matching, longest common substring, matching statistics, and longest repeated substring. Recent lines of research focused on extending some of these problems to vertex-labeled graphs, either by using efficient ad-hoc approaches which do not generalize to all input graphs, or by indexing difficult graphs and having worst-case exponential complexities. In the absence of an ubiquitous and polynomial tool like the suffix tree for labeled graphs, we introduce the labeled direct product of two graphs as a general tool for obtaining optimal algorithms in the worst case: we obtain conceptually simpler algorithms for the quadratic problems of string matching (SMLG) and longest common substring (LCSP) in labeled graphs. Our algorithms run in time linear in the size of the labeled product graph, which may be smaller than quadratic for some inputs, and their run-time is predictable, because the size of the labeled direct product graph can be precomputed efficiently. We also solve LCSP on graphs containing cycles, which was left as an open problem by Shimohira et al. in 2011. To show the power of the labeled product graph, we also apply it to solve the matching statistics (MSP) and the longest repeated string (LRSP) problems in labeled graphs. Moreover, we show that our (worst-case quadratic) algorithms are also optimal, conditioned on the Orthogonal Vectors Hypothesis. Finally, we complete the complexity picture around LRSP by studying it on undirected graphs.

关键词： Longest repeated substring Longest common substring string algorithm Graph algorithm Motif discovery Fine-grained complexity

来源：评论

学校读者我要写书评

暂无评论

algorithms for Finding a Minimum Repetition Representation of a string

Algorithms for Finding a Minimum Repetition Representation o...

引用

17th International Symposium on string Processing and Information Retrieval

作者： Nakamura, Atsuyoshi Saito, Tomoya Takigawa, Ichigaku Mamitsuka, Hiroshi Kudo, Mineichi Hokkaido Univ Kita Ku Kita 14Nishi 9 Sapporo Hokkaido 0600814 Japan Kyoto Univ Inst Chem Res Uji 611 Japan

ISBN: (纸本)9783642163203

A string with many repetitions can be written compactly by replacing h-fold contiguous repetitions of substring r with (r)h. We refer to such a compact representation as a repetition representation string or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding a minimum RRS or MRRS, where the size of an RRS is defined to be the sum of its component letter sizes and the sizes needed to describe the repetitions (.)h which are defined as w(R)(h) using a repetition weight function w(R). We develop two dynamic programming algorithms to solve the problem. One is CMR that works for any repetition weight function, and the other is CMR-C that is faster but can be applied only when the repetition weight function is constant. CMR-C is an O(w(n + z))-time algorithm using O(n + z) space for a given string with length n, where w and z are the number of distinct primitive tandem repeats and the number of their occurrences, respectively. Since w = O(n) and z = O(n log n) in the worst case, CMR-C is an O(n(2) log n)-time O(n log n)-space algorithm, which is faster than OMR by ((log n)/n)-factor.

关键词： tandem repeat string algorithm

来源：评论

学校读者我要写书评

暂无评论

Simon's congruence pattern matching

引用

THEORETICAL COMPUTER SCIENCE 2024年 994卷

作者： Kim, Sungmin Ko, Sang-Ki Han, Yo -Sub Yonsei Univ Dept Comp Sci 50 Yonsei Ro Seoul 03722 South Korea Univ Seoul Dept Artificial Intelligence 163 Seoulsiripdae Ro Seoul 02504 South Korea

The Simon's congruence problem is to determine whether or not two strings have the same set of subsequences of length no greater than a given integer, and the problem can be answered in linear time. We consider the Simon's congruence pattern matching problem that looks for all substrings of a text that are congruent to a pattern under the Simon's congruence. We propose a linear time algorithm by reusing results from previous computations with the help of new data structures called X -trees and Y -trees. Moreover, we investigate several variants of the problem such as identifying the shortest substring or subsequence of the text that is congruent to the pattern under the Simon's congruence, or finding frequent matchings. We design efficient algorithms for these problems. We conclude the paper with two open problems: finding the longest congruent subsequence and optimizing the pattern matching problem.

关键词： Pattern matching Simon's congruence string algorithm Data structure

来源：评论

学校读者我要写书评

暂无评论

A Linear-Time algorithm for Hamming Distance with Shifts

引用

THEORY OF COMPUTING SYSTEMS 2009年第3期44卷 349-355页

作者： Jiang, Minghui Utah State Univ Dept Comp Sci Logan UT 84322 USA

The Hamming distance with shifts was introduced by Bookstein et al. as a generalization of the traditional Hamming distance to allow a tunable degree of fuzziness when comparing two binary sequences of the same length... 详细信息

关键词： Hamming distance Swap distance string algorithm Combinatorial pattern matching

来源：评论

学校读者我要写书评

暂无评论

On finding a longest common palindromic subsequence

引用

THEORETICAL COMPUTER SCIENCE 2018年 710卷 29-34页

作者： Bae, Sang Won Lee, Inbok Kyonggi Univ Dept Comp Sci Suwon 443760 South Korea Korea Aerosp Univ Dept Software & Comp Engn Goyang 412791 South Korea

Recently, Chowdhury et al. [5] proposed the longest common palindromic subsequence problem. It is a variant of the well-known LCS problem, which refers to finding a palindromic LCS between two strings T-1 and T-2. In this paper, we present a new O (n + R-2)-time algorithm where n = vertical bar T-1 vertical bar = vertical bar T-2 vertical bar and R is the number of matches between T-1 and T-2. We also show that the average running time of our algorithm is O(n(4)/vertical bar Sigma vertical bar(2)), where E is the alphabet of T-1 and T-2. This improves the previously best algorithms whose running times are O(n(4)) and O(R-2 log(2)n log logn). (C) 2017 Elsevier B.V. All rights reserved.

关键词： string algorithm Longest common palindromic subsequence

来源：评论

学校读者我要写书评

暂无评论

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

引用

THEORY OF COMPUTING SYSTEMS 2021年第5期65卷 777-797页

作者： Ayad, Lorraine A. K. Badkobeh, Golnaz Fici, Gabriele Heliou, Alice Pissis, Solon P. Kings Coll London Dept Informat London England Goldsmiths Univ London Dept Comp London England Univ Palermo Dipartimento Matemat & Informat Palermo Italy Ecole Polytech Lab Informat Ecole Polytech Palaiseau France CWI Amsterdam Netherlands Vrije Univ Amsterdam Netherlands

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y(1), ... , y(k) over an alphabet Sigma, we are asked to compute the set M-{y1,...,yk}(l) of minimal absent words of length at most l of the collection {y(1), ..., y(k)}. The set M-{y1,...,yk}(l) contains all the words x such that x is absent from all the words of the collection while there exist i, j, such that the maximal proper suffix of x is a factor of y(i) and the maximal proper prefix of x is a factor of y(j). In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set M-y(l) of minimal absent words of a word y is equal to M-{y1,...,yk}(l) for any decomposition of y into a collection of words y1,..., yk such that there is an overlap of length at least l - 1 between any two consecutive words in the collection. This computation generally requires Omega(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Omega(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when parallel to M-{y1,...,yk}(l)parallel to = o(n), for all N is an element of [1, k], where parallel to S parallel to denotes the sum of the lengths of words in set S. For instance, in the human genome, n approximate to 3x10(9) but parallel to M-{y1,...,yk}(12) parallel to approximate to 10(6). We consider a constant-sized alphabet for stating our results. We show that all M-y1(l) , ... , M-{y1,...,yk}(l) can be computed in O( kn + Sigma(k)(N=1) parallel to M-{y1,...,yk}(l) parallel to) total time using O(MAXIN + MAXOUT) space, where MAXIN is the length of the longest word in {y(1), ... , y(k)} and MAXOUT = max{parallel to M-{y1,...,yk}(l)parallel to : N is an element of [

关键词： Absent word Antidictionary string algorithm Output sensitive algorithm Data compression

来源：评论

学校读者我要写书评

暂无评论

Probabilistic Arithmetic Automata and Their Applications

引用

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012年第6期9卷 1737-1750页

作者： Marschall, Tobias Herms, Inke Kaltenbach, Hans-Michael Rahmann, Sven CWI Life Sci Grp NL-1098 XG Amsterdam Netherlands Univ Bielefeld Genome Informat Grp Fac Technol D-33594 Bielefeld Germany Swiss Fed Inst Technol Computat Syst Biol Grp Dept Biosyst Sci & Engn CH-4058 Basel Switzerland Univ Duisburg Essen Fac Med Inst Human Genet D-45122 Essen Germany

We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models;2) exact analysis of window-based pattern matching algorithms;3) sensitivity of filtration seeds used to detect candidate sequence alignments;4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions;and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework.

关键词： Probabilistic automaton text model hidden Markov model pattern matching statistics clump string algorithm analysis of algorithms alignment seed peptide mass fingerprinting DNA sequencing dynamic programming

来源：评论

学校读者我要写书评

暂无评论

A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences

引用

APPLIED MATHEMATICS LETTERS 2002年第1期15卷 11-16页

作者： Morgenstern, B GSF Natl Res Ctr Environm & Hlth Inst Biomath & Biometry D-85764 Neuherberg Germany

In the segment-based approach to sequence alignment, nucleic acid, and protein sequence alignments are constructed from fragments, i.e., from pairs of ungapped segments of the input sequences. Given a set F of candidate fragments and a weighting function w : F --> R-0(+), the score of an alignment is defined as the sum of weights of the fragments it consists of, and the optimization problem is to find a consistent collection of pairwise disjoint fragments with maximum sum of weights. Herein, a sparse dynamic programming algorithm is described that solves the pairwise segment-alignment problem in O(L + N-max) space where L is the maximum length of the input sequences while N-max less than or equal to #F holds. With a recently introduced weighting function w, small sets F of candidate fragments are sufficient to obtain alignments of high quality, As a result, the proposed algorithm runs in essentially linear space. (C) 2001 Elsevier Science Ltd. All rights reserved.

关键词： sequence alignment string algorithm fragment chaining dynamic programming complexity

来源：评论

学校读者我要写书评

暂无评论

Irredundant tandem motifs

引用

THEORETICAL COMPUTER SCIENCE 2014年 525卷 89-102页

作者： Parida, Laxmi Pizzi, Cinzia Rombo, Simona E. IBM TJ Watson Res Ctr Ossining NY USA Univ Padua Dept Informat Engn I-35100 Padua Italy Univ Palermo Dept Math & Comp Sci I-90133 Palermo Italy

Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings < m(1), m(2)> occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that hold only for this class of motifs. Then, we eliminate the remaining redundancy by defining the notion of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandem motifs is O(d(2)n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. The notions and bounds provided for tandem motifs are generalized for the case r >= 2, if r is the number of subwords composing the motifs. Finally, we also provide an algorithm to extract irredundant tandem motifs. (C) 2013 Elsevier B.V. All rights reserved.

关键词： Motifs Tandem Patterns Irredundant Redundant motifs string algorithm Suffix tree

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：