In this article we shortly present new string pattern matching algorithm. The algorithm uses novel technique for skipping unnecessary comparisons in pattern searching phase. The pattern searching is applied in almost ...
详细信息
ISBN:
(纸本)9789532900910
In this article we shortly present new string pattern matching algorithm. The algorithm uses novel technique for skipping unnecessary comparisons in pattern searching phase. The pattern searching is applied in almost all branches of science such as bioinformatics, information security, text mining, etc. In the context of continuous increase of data, efficient algorithms are necessary to ensure that one can find a pattern in a sequence in a fast and accurate manner. pattern searching solves the problem of finding a pattern exhibiting certain properties within a given sequence of symbols. Concept of the new algorithm presented in this article is based on a character index in a pattern, aiming at, but not limited to patterns in DNA sequences.
The biological functions of an RNA molecule are largely determined by molecular configuration. Understanding the link between the structure and the biological functions has been considered one of the challenges in bio...
详细信息
ISBN:
(纸本)9789897583537
The biological functions of an RNA molecule are largely determined by molecular configuration. Understanding the link between the structure and the biological functions has been considered one of the challenges in biology. In this study, we face the problem of identifying a given structural pattern into an RNA pseudoknotfree secondary structure. We introduce a context-free grammar, Loop Grammar, that formalizes the primary and secondary structure of an RNA molecule as a composition of loops. Such composition is expressed as to concatenation or nesting of the simplest structural elements, hairpins, generated during the folding process when a bond between two nonconsecutive nucleotides is established. Then, we formalize the concatenation and nesting on Fatgraphs, oriented surfaces with boundary, and we define a Surface Loop Grammar, whose algebraic expressions uniquely identify such surfaces associated with given RNA structures. The terms of the Loop Grammar allow us to face the problems of identifying substructures considering both the primary and secondary structures, while the strings generated by Surface Loop Grammar permit to identify a given structural pattern in a secondary structure in terms of relations among hairpins. Both use the string pattern matching.
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology, We design a g...
详细信息
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology, We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words, each of inverted right perpendicular log(2)(n)inverted left perpendicular bits, plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k-mismatch SUS for every string position using a total of O(n) and O(n(2)) time, respectively, regardless of the value of k. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement. Experimental study shows that the peak memory usage of our proposal is consistently 9n bytes for any string of size n, validating the claim that our solution is in-place. Further, our proposal uses much less memory and is much faster than the currently best work that has implementation for exact SUS finding. (C) 2017 Elsevier B.V. All rights reserved.
Presently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. ...
详细信息
ISBN:
(纸本)9783642239816
Presently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. Data extraction from HTML is usually performed by software modules called wrappers. In many studies of constructing wrapper, concluding the pattern of the Web site is a importance task in the beginning. This paper studies the problem of concluding pattern from a Web page that contains several nested structure and repeated structure. In our method, the algorithm bases on string pattern matching can discover the nested structure and the repeated structure in a Web page. Then a regular expression will be generated as the pattern of the Web site.
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a g...
详细信息
ISBN:
(纸本)9783662489710;9783662489703
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology. We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k-mismatch SUS for every string position using a total of O(n) and O(n(2)) time, respectively, regardless of the value of k. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement.
This paper presents two algorithms of string pattern matching. These algorithms employ the inverted lists to accommodate the stringpattern to be searched for. The first solution scans the text in a single pass for al...
详细信息
ISBN:
(纸本)9781424445189
This paper presents two algorithms of string pattern matching. These algorithms employ the inverted lists to accommodate the stringpattern to be searched for. The first solution scans the text in a single pass for all occurrences of stringpattern. The second solution, which improves the first one, takes the comparison times equal to the length of pattern plus the number of comparisons that lead to be mismatched.
We revisit the regular-expression matching problem with respect to prefix-freeness of the pattern. It is known that a prefix-free pattern gives only a linear number of matching substrings in the size of an input text....
详细信息
We revisit the regular-expression matching problem with respect to prefix-freeness of the pattern. It is known that a prefix-free pattern gives only a linear number of matching substrings in the size of an input text. We improve the previous algorithm and suggest an efficient algorithm that finds all pairs (start, end) of start and end positions of all matching substrings with a singles can of the input when the pattern is a prefix-free regular expression.
This paper presents the main content of IE (Internet Explore) Internet record data, what fields are of forensic interest and what information the available tools can extract. It shows that remnants may still be found ...
详细信息
ISBN:
(纸本)9781424452729
This paper presents the main content of IE (Internet Explore) Internet record data, what fields are of forensic interest and what information the available tools can extract. It shows that remnants may still be found in unallocated disk space even the IE Internet record normal file data is deleted. This paper proposes a string pattern matching algorithm to find IE Internet record data in unallocated space based on known internal record structures. The system developed can recover deleted IE records and it may be an approach to obtain evidence in unallocated disk space.
We introduce an algorithm that can be run in sublinear time and has, under certain circumstances, a high probability of greatly reducing the amount of data from the text that must be considered in order to solve strin...
详细信息
ISBN:
(纸本)1932415637
We introduce an algorithm that can be run in sublinear time and has, under certain circumstances, a high probability of greatly reducing the amount of data from the text that must be considered in order to solve string pattern matching problems.
We study different efficient implementations of an Aho-Corasick patternmatching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively sm...
详细信息
We study different efficient implementations of an Aho-Corasick patternmatching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7-bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the patternmatching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple-array, and doublearray representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho-Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple-array and doublearray performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright (C) 2006 John Wiley & Sons, Ltd.
暂无评论