检索结果-内蒙古大学图书馆

Optimal parallel superprimitivity testing for square arrays

Parallel Processing Letters 1996年第3期6卷 299-308页

作者： Iliopoulos, Costas S. Department of Computer Science King's College London Strand London United Kingdom

We present an optimal O(log log n) time algorithm on the CRCW PRAM which tests whether a square array, A, of size n × n, is superprimitive. If A is not superprimitive, the algorithm returns the quasiperiod, i.e.,... 详细信息

关键词： Complexity Parallel algorithms Pram algorithms string algorithms

来源：评论

学校读者我要写书评

暂无评论

Handling Language Variations in Open Source Bug Reporting Systems

Handling Language Variations in Open Source Bug Reporting Sy...

引用

23rd IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

作者： Banerjee, Sean Musgrove, Jesse Cukic, Bojan W Virginia Univ Lane Dept Comp Sci & Elect Engn Morgantown WV 26506 USA

ISBN: (纸本)9780769549286;9781467350488

Natural language plays a critical role in the design, development and maintenance of software systems. For example, bug reporting systems allow users to submit reports describing observed anomalies in free form English. However, the free form aspect makes the detection of duplicate reports a challenge due to the breadth and diversity of language used by individual reporters. Tokenization, stemming and stop word removal are commonly used techniques to normalize and reduce the language space. However, the impact of typographical errors and alternate spellings has not been analyzed in the research literature. Our research indicates that handling language problems during automated bug triage analysis can lead to a boost in performance. We show that the language used in software problem reporting is too specialized to benefit from domain independent spell checkers or lexical databases. Therefore, we present a novel approach using word distance and neighbor word likelihood measures for detecting and resolving language-based issues in open-source software problem reporting. We evaluate our approach using the complete Firefox repository until March 2012. Our results indicate measurable improvements in duplicate detection results, while reducing the language space for most frequently used words by 30%. Moreover, our method is language-agnostic and does not require a pre-built dictionary, thus making it suitable for use in a variety of systems.

关键词： Typographical Errors Alternate Spellings Duplicate Bug Reports string algorithms Software Maintenance Software Reliability

来源：评论

学校读者我要写书评

暂无评论

Longest Common Prefix Arrays for Succinct k-Spectra 1

引用

30th International Symposium on string Processing and Information Retrieval (SPIRE) / 18th Workshop on Compression, Text, and algorithms (WCTA)

作者： Alanko, Jarno N. Biagi, Elena Puglisi, Simon J. Univ Helsinki Dept Comp Sci HIIT Helsinki Finland

ISBN: (数字)9783031439803

ISBN: (纸本)9783031439797;9783031439803

The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudo-alignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays. Our C++ implementations of these algorithms are available at https://***/jnalanko/kmer-lcs.

关键词： longest common prefix LCP longest common suffix k-mer string algorithms compressed data structures de Bruijn graph Burrows-Wheeler transform BWT

来源：评论

学校读者我要写书评

暂无评论

Heuristic Algorithm for Generalized Function Matching 23rd

Heuristic Algorithm for Generalized Function Matching

引用

23rd KES International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES)

作者： Mincu, Radu Stefan Univ Bucharest Dept Comp Sci Bucharest Romania

The problem of generalized function matching can be defined as follows: given a pattern p = p(1)... p(m) and a text t = t(1)...t(n), find a mapping f : Sigma(p) -> Sigma(t)* and all text locations i such that f(p(1))f(p(2)) ... f(p(m)) = t(i) ... t(1), a substring of t. By modifying the restrictions of the matching function f, one can obtain different matching problems, many of which have important applications. When f :Sigma(p) -> Sigma(t) we are faced with problems found in the well-established field of combinatorial pattern matching. If the single character constraint is lifted and f : Sigma(p) -> Sigma(t)*, we obtain generalized function matching as introduced by Amir and Nor (JDA 2007). If we further constrain f to be injective, then we arrive at generalized parametrized matching as defined by Clifford et al. (SPIRE 2009). There are a number of important applications for pattern matching in computational biology, text editors and data compression, to name a few. Therefore, many efficient algorithms have been developed for a wide variety of specific problems including finding tandem repeats in DNA sequences, optimizing embedded systems by reusing code etc. In this work we present a heuristic algorithm illustrating a practical approach to tackling a variant of generalized function matching where f : Sigma(p) -> Sigma(+)(t) and demonstrate its performance on human-produced text as well as random strings. (C) 2019 The Authors. Published by Elsevier B.V.

关键词： string algorithms pattern matching heuristics

来源：评论

学校读者我要写书评

暂无评论

Finding a longest open reading frame of an alternatively spliced gene

Finding a longest open reading frame of an alternatively spl...

引用

IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)

作者： Moore, Neil Jaromczyk, Jerzy Wl. Univ Kentucky Dept Comp Sci Lexington KY 40506 USA

ISBN: (纸本)9781457716133

This paper provides a deterministic algorithm for finding a longest open reading frame (ORF) among all alternative splicings of a given DNA sequence. Finding protein encoding regions is a fundamental problem in genomic DNA sequence analysis and long ORFs generally provide good predictions of such regions. Although the number of splice variants is exponential in the number of optionally spliced regions, we are able to in many cases obtain quadratic or even linear performance. This efficiency is achieved by limiting the size of the search space for potential ORFs: by properly pruning the search space we can reduce the number of frames considered at any one time while guaranteeing that a longest open reading frame must be among the considered frames.

关键词： string algorithms protein coding regions open reading frames (ORF) alternative splicing

来源：评论

学校读者我要写书评

暂无评论

An input sensitive online algorithm for LCS computation

An input sensitive online algorithm for LCS computation

引用

14th Prague stringology Conference (PSC)

作者： Hyyro, Heikki Univ Tampere Dept Comp & Informat Sci FIN-33101 Tampere Finland

ISBN: (纸本)9788001044032

\Ne consider the classic problem of computing (the length of) the longest common subsequence,CS) between two strings A and 8 with lengths 7fl and n, respectively. Fherc are several input sensitive algorithms for this problem, sucl i as the O(sigma n + min{L)}) algorithms by Rick [15] and Goeman and Clausen [5] and the O(sigma d. min{sigma d, Lm}) algorithms by Chin and Poon [4] and Rick [15]. Here L is the length of the LCS and d is the number of dominant matches between A and B, and a is the alphabet size. These algorithms require O(sigma n) time preprocessing for both A and B. We propose a new fairly simple O(sigma m + min{Lm, L(n - L)}) time algorithm that works in online manner: It needs to preprocess only A, and it can process B one character at a time, without knowing the whole string B beforehand. The algorithm also adapts well to the linear space' scheme of -Hirschberg;[6] for recovering the LCS, wh ich was not as easy with the above-mentioned algorithms. fiL addition, our scheme fits well into the context of incremental string comparison [2,10]. The original algorithm of Landau et al. [12] for this problem uses O(sigma m +Lm) space. By using our scheme instead, the space usage becomes O(sigma m + min {Lm, L(n - L)}).

关键词： string algorithms longest common subsequences incremental string comparison

来源：评论

学校读者我要写书评

暂无评论

Parallel algorithms for degenerate and weighted sequences derived from high throughput sequencing technologies

Parallel algorithms for degenerate and weighted sequences de...

引用

14th Prague stringology Conference (PSC)

作者： Iliopoulos, Costas S. Miller, Mirka Pissis, Solon P. Kings Coll London Dept Comp Sci London WC2R 2LS England

ISBN: (纸本)9788001044032

Novel high throughput sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a muck lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of degenerate and weighted sequences to a reference genome, based on whether they occur exactly once in the get rot ire or not, and by taking into consideration probability scores. in particular, we design parallel algorithms for Massive Exact and Approximate Unique Pattern Matching for degenerate and weighted sequences derived from high throughput sequencing technologies.

关键词： parallel algorithms string algorithms high throughput sequencing technologies

来源：评论

学校读者我要写书评

暂无评论

Principled Dictionary Pruning for Low-Memory Corpus Compression 14

Principled Dictionary Pruning for Low-Memory Corpus Compress...

引用

37th Annual International ACM Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval

作者： Tong, Jiancong Wirth, Anthony Zobel, Justin Nankai Univ Coll Comp & Control Engn Tianjin Peoples R China Univ Melbourne Dept Comp & Informat Syst Melbourne Vic Australia

ISBN: (纸本)9781450322591

Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary;in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-off between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.

关键词： Corpus compression string algorithms retrieval efficiency optimization

来源：评论

学校读者我要写书评

暂无评论

Estimating the Longest Increasing Subsequence in Nearly Optimal Time 63

Estimating the Longest Increasing Subsequence in Nearly Opti...

引用

63rd Annual IEEE Symposium on Foundations of Computer Science (FOCS)

作者： Andoni, Alexandr Nosatzki, Negev Shekel Sinha, Sandip Stein, Clifford Columbia Univ New York NY 10027 USA

ISBN: (纸本)9781665455190

Longest Increasing Subsequence (LIS) is a fundamental statistic of a sequence, and has been studied for decades. While the LIS of a sequence of length n can be computed exactly in time O(n log n), the complexity of estimating the (length of the) LIS in sublinear time, especially when LIS << n, is still open. We show that for any n is an element of N and lambda = o(1), there exists a (randomized) non-adaptive algorithm that, given a sequence of length n with LIS >= lambda n, approximates the LIS up to a factor of 1/lambda(o(1)) in n(o(1))/lambda time. Our algorithm improves upon prior work substantially in terms of both approximation and run-time: (i) we provide the first sub-polynomial approximation for LIS in sub-linear time;and (ii) our run-time complexity essentially matches the trivial sample complexity lower bound of Omega(1/lambda), which is required to obtain any non-trivial approximation of the LIS. As part of our solution, we develop two novel ideas which may be of independent interest. First, we define a new Genuine-LIS problem, in which each sequence element may be either genuine or corrupted. In this model, the user receives unrestricted access to the actual sequence, but does not know a priori which elements are genuine. The goal is to estimate the LIS using genuine elements only, with the minimal number of tests for genuineness. The second idea, Precision Tree, enables accurate estimations for composition of general functions from "coarse" (sub-)estimates. Precision Tree essentially generalizes classical precision sampling, which works only for summations. As a central tool, the Precision Tree is pre-processed on a set of samples, which thereafter is repeatedly used by multiple components of the algorithm, improving their amortized complexity.

关键词： sublinear algorithms approximation algorithms randomized algorithms longest increasing subsequence string algorithms

来源：评论

学校读者我要写书评

暂无评论

On Extended Special Factors of a Word 1

引用

25th International Symposium on string Processing and Information Retrieval (SPIRE)

作者： Charalampopoulos, Panagiotis Crochemore, Maxime Pissis, Solon P. Kings Coll London Dept Informat London England Univ Paris Est Marne La Vallee France

ISBN: (数字)9783030004798

ISBN: (纸本)9783030004798;9783030004781

An extended special factor of a word x is a factor of x whose longest infix can be extended by at least two distinct letters to the left or to the right and still occur in x. It is called extended bispecial if it can be extended in both directions and still occur in x. Let rho(n) be the maximum number of extended bispecial factors over all words of length n. Almirantis et al. have shown that 2n - 6 <= rho(n) <= 3n - 4 [WABI 2017]. In this article, we show that there is no constant c < 3 such that rho(n) <= cn. We then exploit the connection between extended special factors and minimal absent words to construct a data structure for computing minimal absent words of a specific length in optimal time for integer alphabets generalising a result by Fujishige et al. [MFCS 2016]. As an application of our data structure, we show how to compare two words over an integer alphabet in optimal time improving on another result by Charalampopoulos et al. [Inf. Comput. 2018].

关键词： Special factors Minimal absent words string algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：