Huge amounts of biological data are stored in linear files. Biological proteins are sequences of amino acids. The quantities of data in these fields tend to increase year on year. string matching algorithms play a key...
详细信息
ISBN:
(纸本)9781424456987
Huge amounts of biological data are stored in linear files. Biological proteins are sequences of amino acids. The quantities of data in these fields tend to increase year on year. string matching algorithms play a key role in many computer science problems, and in the implementation of computer software. For this reason efficient string-matching algorithms should be used which use minimal computer storage and which minimize the searching response time. In this study, we propose a new algorithm called the Random string Matching Algorithm (RSMA). RSMA combines our enhanced preprocessing phase from the Berry Ravindran algorithm with our proposed new searching phase procedure. This variety of searching order allows our proposed algorithm to reduce the number of comparison characters and enhances the searching response time. Experimental results show that the RSMA algorithm offers a smaller number of comparisons and offers improved elapsed searching time when compared to other well-known algorithms.
Accelerating multi-pattern matching is a critical issue in building high-performance deep packet inspection systems. Achieving high-throughputs while reducing both memory-usage and memory-bandwidth needs is inherently...
详细信息
ISBN:
(纸本)9781424435128
Accelerating multi-pattern matching is a critical issue in building high-performance deep packet inspection systems. Achieving high-throughputs while reducing both memory-usage and memory-bandwidth needs is inherently difficult. In this paper, we propose a pattern (string) matching algorithm that achieves high throughput while limiting both memory-usage and memory-bandwidth. We achieve this by moving away from a byte-oriented processing of patterns to a block-oriented scheme. However, different from previous block-oriented approaches, our scheme uses variable-stride blocks. These blocks can be uniquely identified in both the pattern and the input stream, hence avoiding the multiplied memory costs which is intrinsic in previous approaches. We present the algorithm, tradeoffs, optimizations, and implementation details. Performance evaluation is done using the Snort and ClamAV pattern sets. Using our algorithm, the throughput of a single search engine can easily have a many-fold increase at a small storage cost, typically less than three bytes per pattern character.
This study focuses on the faster exact single pattern string matching algorithms. In all solutions, two variants of BOM, EBOM and FBOM are very efficient. We improved them and presented two algorithms named Simplified...
详细信息
We present a method for correcting real-word spelling errors using the Google Web 1T n-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method ...
详细信息
ISBN:
(纸本)9781605585123
We present a method for correcting real-word spelling errors using the Google Web 1T n-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the correction recall (the fraction of errors corrected) while keeping the correction precision (the fraction of suggestions that are correct) as high as possible. Evaluation results on a standard data set show that our method performs very well. Copyright 2009 ACM.
We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or algorithms. At each alignment the algorithms read a q-gram before testing the state variable. In add...
详细信息
ISBN:
(纸本)9781615671489
We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or algorithms. At each alignment the algorithms read a q-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction. Our experiments show that many of the new variations are substantially faster than any previous string matching algorithm on x86 processors for English and DNA data.
We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or algorithms. At each alignment the algorithms read a q-gram before testing the state variable. In add...
详细信息
Exact matching of single patterns in DNA and amino acid sequences is studied. We performed an extensive experimental comparison of algorithms presented in the literature. In addition, we introduce new variations of ea...
详细信息
Our digital universe is growing, creating exploding amounts of data which need to be searched, filtered and protected. stringsearching is at the core of the tools we use to curb this explosion, such as search engines...
详细信息
ISBN:
(纸本)9781424416936
Our digital universe is growing, creating exploding amounts of data which need to be searched, filtered and protected. stringsearching is at the core of the tools we use to curb this explosion, such as search engines, network intrusion detection systems, spam filters, and anti-virus programs. But as communication speed grows, our capability to perform stringsearching in real-time seems to fall behind. Multi-core architectures promise enough computational power to cope with the incoming challenge, but it is still unclear which algorithms and programming models to use to unleash this power. We have parallelized a popular stringsearching algorithm, Aho-Corasick, on the IBM Cell/B.E. processor, with the goal of performing exact string matching against large dictionaries. In this article we propose a novel approach to fully exploit the DMA-based communication mechanisms of the Cell/B.E. to provide an unprecedented level of aggregate performance with irregular access patterns. We have discovered that memory congestion plays a crucial role in determining the performance of this algorithm. We discuss three aspects of congestion: memory pressure, layout issues and hot spots, and we present a collection of algorithmic solutions to alleviate these problems and achieve quasi-optimal performance. The implementation of our algorithm provides a worst-case throughput of 2.5 Gbps, and a typical throughput between 3.3 and 4.4 Gbps, measured on realistic scenarios with a two-processor Cell/B.E. system.
string matching is one of the fundamentals in various text-processing applications such as text mining and content filtering systems. This paper describes a fast string matching algorithm using a compact pattern match...
详细信息
ISBN:
(纸本)9781424433964
string matching is one of the fundamentals in various text-processing applications such as text mining and content filtering systems. This paper describes a fast string matching algorithm using a compact pattern matching machine DAWG. A directed acyclic word graph (DAWG) is traditionally implemented with a 2-dimensional linked list or matrix. However, DAWGs with these structures have drawbacks, the lookup time or the linked list based one is slow and the space requirement of the matrix based one is large. Therefore, this paper proposes a novel DAWG based on a compacted double-array, which overcomes the drawbacks of traditional ones. Experimental results show that the novel DAWG is more efficient than traditional ones.
string matching is a fundamental issue in computer science. This paper presents a lightweight string matching algorithm for short pattern matching, in which less than 20 keywords are often involved in the pattern set....
详细信息
ISBN:
(纸本)9780769533087
string matching is a fundamental issue in computer science. This paper presents a lightweight string matching algorithm for short pattern matching, in which less than 20 keywords are often involved in the pattern set. The new algorithm makes use of condensed hash tables and computes the shift distance after each test by observing the character that immediately passes the test window. Experiments show that the new algorithm improves execution speed and decreases memory requirement. This algorithm is suitable for applications with small pattern set (i.e. containing up to 30 keywords), particularly for embedded equipments.
暂无评论