Aho and Corasick presented a string pattern matching machine (hereafter called machine AC) to locate multiple keywords. However, the machine AC must be reconstructed all over again when a keyword is appended. This pap...
详细信息
Aho and Corasick presented a string pattern matching machine (hereafter called machine AC) to locate multiple keywords. However, the machine AC must be reconstructed all over again when a keyword is appended. This paper proposes an efficient algorithm to append a keyword for the machine AC. This paper presents the time efficiency comparison with the original algorithm using the actual simulation results. The simulation results show the speed up factor, by the algorithm proposed, to be between 25 and 270 fold when compared with the original algorithm by Aho and Corasick which requires the reconstruction of the entire machine AC.
This paper presents two algorithms of string pattern matching. These algorithms employ the inverted lists to accommodate the stringpattern to be searched for. The first solution scans the text in a single pass for al...
详细信息
ISBN:
(纸本)9781424445189
This paper presents two algorithms of string pattern matching. These algorithms employ the inverted lists to accommodate the stringpattern to be searched for. The first solution scans the text in a single pass for all occurrences of stringpattern. The second solution, which improves the first one, takes the comparison times equal to the length of pattern plus the number of comparisons that lead to be mismatched.
We introduce an algorithm that can be run in sublinear time and has, under certain circumstances, a high probability of greatly reducing the amount of data from the text that must be considered in order to solve strin...
详细信息
ISBN:
(纸本)1932415637
We introduce an algorithm that can be run in sublinear time and has, under certain circumstances, a high probability of greatly reducing the amount of data from the text that must be considered in order to solve string pattern matching problems.
Presently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. ...
详细信息
ISBN:
(纸本)9783642239816
Presently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. Data extraction from HTML is usually performed by software modules called wrappers. In many studies of constructing wrapper, concluding the pattern of the Web site is a importance task in the beginning. This paper studies the problem of concluding pattern from a Web page that contains several nested structure and repeated structure. In our method, the algorithm bases on string pattern matching can discover the nested structure and the repeated structure in a Web page. Then a regular expression will be generated as the pattern of the Web site.
This paper presents a Boyer-Moore-type algorithm for regular expression patternmatching, answering an open problem posed by Aho in 1980 (patternmatching in strings, Academic Press, New York, 1980, p. 342). The new a...
详细信息
This paper presents a Boyer-Moore-type algorithm for regular expression patternmatching, answering an open problem posed by Aho in 1980 (patternmatching in strings, Academic Press, New York, 1980, p. 342). The new algorithm handles patterns specified by regular expressions-a generalization of the Boyer-Moore and Commentz-Walter algorithms. Like the Boyer-Moore and Commentz-Walter algorithms, the new algorithm makes use of shift functions which can be precomputed and tabulated. The precomputation algorithms are derived, and it is shown that the required shift functions can be precomputed from Commentz-Walter's d(1) and d(2) shift functions. In certain cases, the Boyer-Moore (respectively Commentz-Walter) algorithm has greatly outperformed the Knuth-Morris-Pratt (respectively Aho-Corasick) algorithm (as discussed by Watson in his Ph.D. Thesis, Eindhoven University of Technology, September 1995, and in: N. Ziviani, R. Baeza-Yates, K. Guimaracs (Eds.), Proc. Third South American Workshop on string Processing, International Informatics Series, vol. 4, Carleton University Press, Recife, Brazil, 1996, pp. 280-294). In testing, the algorithm presented in this paper also frequently outperforms the regular expression generalization of the Aho-Corasick algorithm. (C) 2003 Elsevier B.V. All rights reserved.
We study different efficient implementations of an Aho-Corasick patternmatching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively sm...
详细信息
We study different efficient implementations of an Aho-Corasick patternmatching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7-bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the patternmatching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple-array, and doublearray representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho-Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple-array and doublearray performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright (C) 2006 John Wiley & Sons, Ltd.
We explore the, regular-expression matching problem with respect to prefix-freeness of the pattern. We prove that a prefix-free regular expression gives only a linear number of matching substrings in the size of a giv...
详细信息
We explore the, regular-expression matching problem with respect to prefix-freeness of the pattern. We prove that a prefix-free regular expression gives only a linear number of matching substrings in the size of a given text. Based on this observation, we propose an efficient algorithm for the prefix-free regular-expression matching problem. Furthermore, we suggest an algorithm to determine whether or not a given regular language is prefix-free. (c) 2007 Elsevier B.V. All rights reserved.
This paper describes an efficient multi-attribute patternmatching machine to locate all occurrences of any of a finite number of a sequence of rule structures in a series of input structures. The matching operation o...
详细信息
This paper describes an efficient multi-attribute patternmatching machine to locate all occurrences of any of a finite number of a sequence of rule structures in a series of input structures. The matching operation of the proposed machine is similar to the method of Aho-Corasick or the method of retrieval using a trie, however, the proposed machine has the following distinctive features: (1) The proposed machine enables us to match set representations containing multiple attributes;(2) It enables us to match separate components;(3) It enables us to match a rule consisting of an exclusive set. In this paper, their features are described in detail. Moreover, the patternmatching algorithm is evaluated by the theoretical observations and the experimental observations that are supported by the simulation results for a variety of rules for document processing as text proofreading, text reduction, and examining a relation between sentences.
Aho and Corasick presented a string pattern matching machine to locate multiple keywords. However, the AC machine could not match multi-attribute information. This paper describes an efficient multi-attribute pattern ...
详细信息
Aho and Corasick presented a string pattern matching machine to locate multiple keywords. However, the AC machine could not match multi-attribute information. This paper describes an efficient multi-attribute patternmatching machine to locate all occurrences of any of a finite number of the sequence of rule structures (called matching rules) in a sequence of input structures. The proposed algorithm enables us to match set representations containing multiple attributes. Therefore, confirming transition is decided by the relationship, whether the input structure includes the rule structure or not. Finally, the patternmatching algorithm is evaluated by theoretical analysis and the evaluation is supported by the simulation results with rules for the extraction of keywords. (C) 1998 Elsevier Science Inc. All rights reserved.
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology, We design a g...
详细信息
We revisit the exact shortest unique substring (SUS) finding problem, and propose its approximate version where mismatches are allowed, due to its applications in subfields such as computational biology, We design a generic in-place framework that fits to solve both the exact and approximate k-mismatch SUS finding, using the minimum 2n memory words, each of inverted right perpendicular log(2)(n)inverted left perpendicular bits, plus n bytes space, where n is the input string size. By using the in-place framework, we can find the exact and approximate k-mismatch SUS for every string position using a total of O(n) and O(n(2)) time, respectively, regardless of the value of k. Our framework does not involve any compressed or succinct data structures and thus is practical and easy to implement. Experimental study shows that the peak memory usage of our proposal is consistently 9n bytes for any string of size n, validating the claim that our solution is in-place. Further, our proposal uses much less memory and is much faster than the currently best work that has implementation for exact SUS finding. (C) 2017 Elsevier B.V. All rights reserved.
暂无评论