high utility sequential pattern (HUSP) mining has emerged as a novel topic in data mining, its computational complexity increases compared to frequent sequences mining and highutility itemsets mining. A number of alg...
详细信息
high utility sequential pattern (HUSP) mining has emerged as a novel topic in data mining, its computational complexity increases compared to frequent sequences mining and highutility itemsets mining. A number of algorithms have been proposed to solve such problem, but they mainly focus on mining HUSP in static databases and do not take streaming data into account, where unbounded data come continuously and often at a high speed. The efficiency of mining algorithms is still the main research topic in this field. In view of this, this paper proposes an efficient HUSP mining algorithm named HUSP-UT (utility on Tail Tree) based on tree structure over data stream. Substantial experiments on real datasets show that HUSP-UT identifies highutility sequences efficiently. Comparing with the state-of-the-art algorithm HUSP-Stream (HUSP mining over data streams) in our experiments, the proposed HUSP-UT outperformed its counterpart significantly, especially for time efficiency, which was up to 1 order of magnitude faster on some datasets. (c) 2019 The Authors. Published by Atlantis Press SARL.
Data mining encompasses various subfields, among which an important branch is highutility itemset mining. Within this domain, exploring high utility sequential patterns is an emerging field of interest, which is to i...
详细信息
Data mining encompasses various subfields, among which an important branch is highutility itemset mining. Within this domain, exploring high utility sequential patterns is an emerging field of interest, which is to identify high utility sequential patterns (HUSPs) within databases. In practice, there are many fields with application of high utility sequential pattern mining, including DNA sequence analysis and network intrusion detection, etc. However, most HUSPM assume that the data in the database is accurate, which is not consistent with the actual situation in the real world. Inevitably, data uncertainty arises due to the collection process, which involves sensors of varying degrees of precision. Although the methods of highutility probability sequentialpattern mining (HUPSPM) in the context of uncertain sequences have been proposed, their performance is unsatisfactory when dealing with a low utility/probability threshold or largescale datasets. Therefore, we propose an efficient HUPSPM algorithm called HUPSP-LAL. We have proposed a new probability calculation framework to mathematically represent the collected uncertain data. We designed the compact structure, PUL - IA - EL , which HUPSP-LAL uses for projection to accelerate the calculation of the utility, probability, and upper bounds of the candidates. This paper introduces two probability-based pruning strategies, complemented by two additional utility-based pruning strategies, all aimed at diminishing the search space. The experimental findings from real datasets indicate that HUPSP-LAL outperforms the leading algorithms significantly regarding patterns, runtime, candidates, and memory consumption.
high utility sequential patterns (HUSPs) are common patterns that can be discovered from the data collected in many domains (e.g. retail, bioinformatics, mobile commerce). To extract these patterns, highutility seque...
详细信息
ISBN:
(纸本)9781728143286
high utility sequential patterns (HUSPs) are common patterns that can be discovered from the data collected in many domains (e.g. retail, bioinformatics, mobile commerce). To extract these patterns, high utility sequential pattern mining (HUSPM) has been proposed in [went decade. Although the HUSPM algorithms provide us a special perspective to analyze the knowledge behind the collected data, it also arises the risk of the privacy leakage and underlying security issues. This leads to the emergence of high utility sequential pattern hiding (HUSPH) whose purpose is to hide all HUSPs in the sequence database under a specified threshold. Around this topic, many algorithms were proposed. However. the existing algorithms are very time-consuming. which makes them unable to process the real massive data quickly. In this paper, we propose an efficient algorithm named FH-HUSP (fast algorithm for hiding high utility sequential patterns) for HUSPH. Substantial experimental results show that the proposed algorithm can hide all high utility sequential patterns quickly under the specific minimum utility with relatively small modifications.
high utility sequential pattern mining (HUSPM) is an emerging topic in data mining. Compared with the previous topics (sequentialpattern mining and highutility itemset mining), HUSPM can provide more applicable know...
详细信息
ISBN:
(纸本)9781728143286
high utility sequential pattern mining (HUSPM) is an emerging topic in data mining. Compared with the previous topics (sequentialpattern mining and highutility itemset mining), HUSPM can provide more applicable knowledge, for it comprehensively considers utility indicating the business value and sequential indicating the causality of different items. However, the combination of utility and sequential brings the dramatic challenges and makes HUSPM more difficult than the previous problems. In this paper, we propose an two efficient algorithms, HUS-UT and HUS-Par, for HUSPM. The proposed IRIS-UT algorithm adopts a novel data structure named utility-Table to facilitate the utility calculation, so it can find the desired patterns quickly. The HUS-Par algorithm is a parallel version of HUS-UT based on the thread model, which also exploits two balance strategies to improve efficiency. We also conduct substantially experiments to evaluate the performance of our algorithms. The experimental results show that our algorithms are much faster than the state-of-the-art algorithms.
Regular pattern mining has been emerged as one of the important sub-domains of data mining with its numerous applications. Although patterns that occur at a regular interval throughout the whole database can lead to i...
详细信息
ISBN:
(纸本)9783030190637;9783030190620
Regular pattern mining has been emerged as one of the important sub-domains of data mining with its numerous applications. Although patterns that occur at a regular interval throughout the whole database can lead to interesting knowledge, examining the utility values of these patterns can unveil more interesting useful information. In a sequence database, the task of mining regular highutilitypatterns can be more challenging. In this paper, we first propose a new algorithm for mining regular high utility sequential patterns from static databases. As handling of the incremental nature of big data brings useful results in many applications in the recent era of big data, we then extend our algorithm to mine regular high utility sequential patterns from dynamic databases. Evaluation results on several real-life datasets show the effectiveness of our two algorithms.
high utility sequential pattern mining is an emerging topic in pattern mining, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies. To identify highutility sequent...
详细信息
high utility sequential pattern mining is an emerging topic in pattern mining, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies. To identify high utility sequential patterns, due to lack of downward closure property in this problem, most existing algorithms first generate candidate sequences with high sequence-weighted utilities (SWUs), which is an upper bound of the utilities of a sequence and all its supersequences, and then calculate the actual utilities of these candidates. This causes a large number of candidates since SWU is usually much larger than the real utilities of a sequence and all its supersequences. In view of this, we propose two tight utility upper bounds, prefix extension utility and reduced sequence utility, as well as two companion pruning strategies, and devise HUS-Span algorithm to identify high utility sequential patterns by employing these two pruning strategies. In addition, since setting a proper utility threshold is usually difficult for users, we also propose algorithm TKHUS-Span to identify top-k high utility sequential patterns by using these two pruning strategies. Three searching strategies, guided depth-first search (GDFS), best-first search (BFS) and hybrid search of BFS and GDFS, are also proposed to improve the efficiency of TKHUS-Span. Experimental results on some real and synthetic datasets show that HUS-Span and TKHUS-Span with strategy BFS are able to generate less candidate sequences and thus outperform other prior algorithms in terms of mining efficiency.
暂无评论