The majority of existing data mining algorithms mine frequent itemsets from precise data. A well-known algorithm is FP-growth, which builds a compact FP-tree structure to capture important contents of precise data and...
详细信息
ISBN:
(纸本)9781479943036
The majority of existing data mining algorithms mine frequent itemsets from precise data. A well-known algorithm is FP-growth, which builds a compact FP-tree structure to capture important contents of precise data and mines frequent itemsets from the FP-tree. However, there are situations in which data are uncertain. To capture important contents (e.g., existential probabilities) of uncertain data for mining frequent itemsets, the UF-growth algorithm uses a UF-tree structure. However, the UF-tree can be large. Other tree structures for handling uncertain data may achieve compactness at the expense of looser upper bounds on expected supports. To solve this problem, we propose fast algorithms that use compact tree structures for capturing uncertain data with tightened upper bounds to expected support (tube) for frequent itemset mining from uncertain data. Experimental results show the tightness of tube provided by our algorithms and the compactness of our tree structures.
The article describes the method of construction of association rules retrieval algorithms out from function blocks having a unified interface and purely functional properties. The usage of function blocks to build as...
详细信息
ISBN:
(纸本)9783319209104;9783319209098
The article describes the method of construction of association rules retrieval algorithms out from function blocks having a unified interface and purely functional properties. The usage of function blocks to build association rules algorithms allows modifying the existing algorithms and building new algorithms with minimum effort. Besides, the function block properties allow to transform the algorithms into parallel form, thus improving their efficiency.
Discovering complex and incomplete periodic patterns in the logs of events is a complicated and time consuming *** work shows that it is possible to discover complex and incomplete periodic patterns through finding si...
详细信息
Discovering complex and incomplete periodic patterns in the logs of events is a complicated and time consuming *** work shows that it is possible to discover complex and incomplete periodic patterns through finding simple patterns first and through logical derivations of complex and incomplete patterns later *** paper defines a syntax and semantics of a class of periodic patterns that frequently occur in the logs of events.A system of derivation rules proposed in the paper can be used to transform a set of periodic patterns into a logically equivalent set of *** rules are used in the algorithms that derive complex and incomplete periodic patterns.A prototype implementation of the algorithms that discover complex and incomplete periodic patterns in the logs of events is presented.
datamining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction. datamining is supposed to be able to handle various d...
详细信息
datamining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction. datamining is supposed to be able to handle various data types in all formats. Relevance of this paper is emphasized by the fact that datamining is an object of research in different areas. In this paper, we review previous works in the context of knowledge extraction from medical data. The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners. Medical datamining is a multidisciplinary field with contribution of medicine and datamining. Due to this fact, previous works should be classified to cover all users' requirements from various fields. Because of this, we have studied papers with the aim of extracting knowledge from structural medical data published between 1999 and 2013. We clarify medical datamining and its main goals. Therefore, each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management. In each task, five datamining approaches are considered: classification, regression, clustering, association and hybrid. At the end of each task, a brief summarization and discussion are stated. A standard framework according to CRISP-DM is additionally adapted to manage all activities. As a discussion, current issue and future trend are mentioned. The amount of the works published in this scope is substantial and it is impossible to discuss all of them on a single work. We hope this paper will make it possible to explore previous works and identify interesting areas for future research. (C) 2014 Elsevier Ltd. All rights reserved.
Hypothesis testing using constrained null models can be used to compute the significance of datamining results given what is already known about the data. We study the novel problem of finding the smallest set of pat...
详细信息
Hypothesis testing using constrained null models can be used to compute the significance of datamining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several datamining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate datamining problems in the terms of statistical significance.
In the last few years, Educational datamining has become an interesting area exploited to discover and extract hidden knowledge of students from educational environment data. During the establishment of this work an ...
详细信息
In the last few years, Educational datamining has become an interesting area exploited to discover and extract hidden knowledge of students from educational environment data. During the establishment of this work an attempt was made to manage the extracted information using mining techniques. These methods took place in order to get groups of students with similar characteristics. The application of classification, clustering and association rules miningalgorithms on the data stored on the e-learning (Moodle system) database allowed to extract knowledges that help to understand students' behaviors and patterns. Additionally, the development of a Web application for the educators is a tool to monitor their students learning behavior by monitoring the number of assignments taken, the number of quizzes taken, the number of forum post and read by students, etc. The knowledge obtained can help the instructors to make decision about their students' interacting with the courses activities in Moodle system, and to create an efficient educational environment. In this research, a datamining tool called RapidMiner was used for mining the data from the Moodle system database, and a web application written in PHP was established to aid teachers with statistics.
Frequent pattern mining is an important datamining task. Since its introduction, it has drawn attention from many researchers. Consequently, many frequent pattern miningalgorithms have been proposed to mine large va...
详细信息
ISBN:
(纸本)9781479942749
Frequent pattern mining is an important datamining task. Since its introduction, it has drawn attention from many researchers. Consequently, many frequent pattern miningalgorithms have been proposed to mine large varieties of high-value data such as high volumes of shopper market basket data. In this paper, we propose a business intelligence (BI) solution for frequent pattern mining on social network data. Evaluation results show that our proposed BI solution is both space-and time-efficient. Moreover, we also discuss the benefits and practicality of our BI solution-which reveals frequent social patterns-in real-life business applications.
UF-growth is a tree-based exact algorithm for mining frequent patterns from uncertain data. While it directly calculates the expected support of a pattern, it requires a significant amount of storage space to capture ...
详细信息
ISBN:
(纸本)9781479942749
UF-growth is a tree-based exact algorithm for mining frequent patterns from uncertain data. While it directly calculates the expected support of a pattern, it requires a significant amount of storage space to capture all existential probability values among the items. To eliminate the extra space requirement of UF-growth, the CUF-growth algorithm combines nodes with the same item by storing an upper bound on expected support. In this paper, we (i) introduce a new concept of domain item-specific capping (DISC) and (ii) propose three new scalable data analytics algorithms that use this concept to achieve a tighter upper bound than CUF-growth. Experimental results show the effectiveness of uncertain frequent pattern mining with tightened upper bounds provided by using the concept of DISC.
Control charts have been widely recognised as important tools in system monitoring of abnormal behaviour and quality improvement. Traditional control charts have a major assumption that successive observations are unc...
详细信息
Control charts have been widely recognised as important tools in system monitoring of abnormal behaviour and quality improvement. Traditional control charts have a major assumption that successive observations are uncorrelated and normally distributed. When this assumption is violated, the traditional control charts do not perform well, but instead show increased false alarm rates. In this study, we propose a datamining model adjustment control chart to address autocorrelation problems for cascade processes. The basic idea of the proposed control chart is to monitor the residuals obtained by datamining models. The datamining models used in this study include support vector regression and artificial neural networks. A simulation study was conducted to evaluate the performance of the proposed control chart and compare it with the standard regression adjustment control chart and the observations-based control chart in terms of average run length performance. The results showed that the proposed datamining model adjustment control charts yielded better performance than the two other methods considered in this study.
In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us pre...
详细信息
In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative outcomes caused by their antecedents. In this paper, we propose an innovative datamining framework and apply it to mine potential causal associations in electronic patient data sets where the drug-related events of interest occur infrequently. Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a datamining algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was employed to rank the potential causal associations between each of the three selected drugs (i.e., enalapril, pravastatin, and rosuvastatin) and 3,954 recorded symptoms, each of which corresponded to a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin, and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of our method in finding potential ADR signal pairs for further analysis (e. g., epidemiology study) and investigation (e. g., case review) by drug safety professionals.
暂无评论